Select columns from a csv file based on user input - python

I am new to Python and I want to make a small program that takes from the user column name or multiple columns name that needed to be plot versus the time.
consider the column names : "time", "c2", "c3","c4", "c5","c6"
the column name needs to be selected from a csv file as a user input to plot a time series curve, However, it did not work for me. Do you have any Idea or similar codes to share?
The code I am using to plot the curves shown below, note that all the columns in the csv file are plotted versus the time column which has been written in epoch and I converted to human readable time later.
import pandas as pd
import pandas
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
# import the csv file and epoch time conversion
df = pd.read_csv(.csv',parse_dates=['time'], date_parser=lambda epoch: pandas.to_datetime(epoch, unit='s'))
print(df)
# make sure the time column is actually time format
df['time']=pd.to_datetime(df['time'])
# set time as the index
df.set_index('time',inplace=True)
df.plot(linewidth=2, fontsize=12)

Probably going to need a bit more information than this to try and help.
Are you using a Web framework to be able to draw your plots like Flask or Django?
CSV files are pretty easy to read with column headings as field identifiers using the csv module.
https://docs.python.org/3/library/csv.html
Hopefully the answers are there for you.

The easiest way to do this would be to use the CSV module and matplotlib.
Matplotlib has a time series example. e.g. 1. You can also look at the other kinds of plots the library can do over here.
It is hard to recommend a method without knowing what kind of data you are working with and what needs to be done.

Related

Adding file name column to Dask DataFrame

I have a data set of around 400 CSV files containing a time series of multiple variables (my CSV has a time column and then multiple columns of other variables).
My final goal is the choose some variables and plot those 400 time series in a graph.
In order to do so, I tried to use Dask to read the 400 files and then plot them.
However, from my understanding, In order to actually draw 400 time series and not a single appended data frame, I should groupby the data by the file name it came from.
Is there any Dask efficient way to add a column to each CSV so I could later groupby my results?
A parquet files is also an option.
For example, I tried to do something like this:
import dask.dataframe as dd
import os
filenames = ['part0.parquet', 'part1.parquet', 'part2.parquet']
df = dd.read_parquet(filenames, engine='pyarrow')
df = df.assign(file=lambda x: filenames[x.index])
df_grouped = df.groupby('file')
I understand that I can use from_delayed() but then I lose al the parallel computation.
Thank you
If you are can work with CSV files, then passing include_path_column option might be sufficient for your purpose:
from dask.dataframe import read_csv
ddf = read_csv("some_path/*.csv", include_path_column="file_path")
print(ddf.columns)
# the list of columns will include `file_path` column
There is no equivalent option for read_parquet, but something similar can be achieved with delayed. Using delayed will not remove parallelism, the code just need to make sure that the actual calculation is done after the delayed tasks are defined.

Seaborn Pairplot with Dataframe vs CSV

I have a dataframe in a Jupyter notebook and do a pairplot on it to get a bunch of plots against each other.
import seaborn as sns
sns.pairplot(df_merge)
Here is the pairplot as a result.
However, it plots the data incorrectly and in a non-aesthetic way. However, when I export this dataframe to a csv and then read it back into the program as a dataframe:
import seaborn as sns
df_merge.to_csv('dataframe.csv')
x = pd.read_csv('dataframe.csv')
sns.pairplot(x)
Sns plots it fine and the correlations between variables can be seen but I have an unnecessary column called Unnamed which I don't need.
Does anyone know what could cause this issue and how I can go about correcting it without needing to export the dataframe as a csv?
When you do:
df_merge.to_csv('dataframe.csv')
you write also the index of df_merge without a name. Then
x = pd.read_csv('dataframe.csv')
reads the index as Unnamed 0 column. To fix this, either save the data frame without index:
df_merge.to_csv('dataframe.csv', index=False)
x = pd.read_csv('dataframe.csv')
or read the csv with index:
df_merge.to_csv('dataframe.csv')
x = pd.read_csv('dataframe.csv', index_col=[0])
Figured out that the issue I was having was when I was changing the dataframe to a CSV and then changing it back to a dataframe, the values in the dataframe had a float64 type where as in my dataframe before they were all objects. Converting all the numerical columns to float before plotting the graph solved my issue.

Getting wrong readings when trying to plot CSV file using pandas

My csv file looks like the following:
As you see there are 7 columns with comma separated. I have spent hours to read and plot the first column starting with 31364 with the following code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('test.csv', sep=',', header=None, names=['colA','colB','colC','colD','colE','colF','colG'])
y = df['colA']
plt.plot(y)
But the code outputs this plot which does not match the data at all:
I'm using Spyder with Anaconda. What could be the problem?
Is column A all values in the 31,000 range? You're not plotting the whole file.
edit: Don't know what result you're looking for. In your code, the first column in your csv is used as the index to the dataframe (after you read the csv, enter 'df', no quotes, at the python prompt to see what your dataset looks like.
If you don't want the first column in the csv as an index, add 'index_col=False', no quotes, to the parameters when you read the csv in.
Also, not a good idea to end lines in a csv wit the delimiter, comma in this case.

Plotting from excel to python with pandas

I am new to python,pandas,etc and i was asked to import, and plot an excel file. This file contains 180 rows and 15 columns and i have to plot each column with respect to the first one which is time, in total 14 different graphs. I would like some help with writing the script. Thanks in advance.
The function you are looking for is pandas.read_excel (Link).
It will return a DataFrame-Object from where you can access your data in python. Make sure you Excel-File is well formatted.
import pandas as pd
# Load data
df = pd.read_excel('myfile.xlsx')
Check out these packages/ functions, you'll find some code on these websites and you can tailor it to your needs.
Some useful codes:
Read_excel
import pandas as pd
df = pd.read_excel('your_file.xlsx')
Code above reads an excel file to python and keeps it as a DataFrame, named df.
Matplotlib
import matplotlib.pyplot as plt
plt.plot(df['column - x axis'], df['column - y axis'])
plt.savefig('you_plot_image.png')
plt.show()
This is a basic example of making a plot using matplotlib and saving it as your_plot_image.png, you have to replace column - x axis and column - y axis with desired columns from your file.
For cleaning data and some basics regarding DataFrames have a look at this package: Pandas

How to refer/assign an excel column in python?

I have a csv file (excel spreadsheet) of a column of roughly a million numbers in column A. I want to make a histogram of this data with the frequency of the numbers on the y-axis and the number quantities on the x-axis. I'm using pandas to do so. My code:
import pandas as pd
pd.read_csv('D1.csv', quoting=2)['A'].hist(bins=50)
Python isn't interpreting 'A' as the column name. I've tried various names to reference the column, but all result in a keyword error. Am I missing a step where I have to assign that column a name via python which I don't know how to?
I need more rep to comment, so I put this as answer.
You need to have a header row with the names you want to use on pandas. Also if you want to see the histogram when you are working from python shell or ipython you need to import pyplot
import matplotlib.pyplot as plt
import pandas as pd
pd.read_csv('D1.csv', quoting=2)['A'].hist(bins=50)
plt.show()
Okay I finally got something to work with headings, titles, etc.
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('D1.csv', quoting=2)
data.hist(bins=50)
plt.xlim([0,115000])
plt.title("Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
My first problem was that matplotlib is necessary to actually show the graph as stated by #Sauruxum. Also, I needed to set the action
pd.read_csv('D1.csv', quoting=2)
to data so I could plot the histogram of that action with
data.hist
Basically, the problem wasn't finding the name to the header row. The action itself needed to be .hist .Thank you all for the help.

Categories

Resources