Functions to use Python, matplotlib and pandas in Statistics [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have to do these with Python, Matplotlib, and Pandas.
read a CSV file separated with "," and decimals
count ALL the lines of the file
plot a bar graph with the values of the column year of the same file
find the expected value of ALL the values of a column
find the quartile (with Python and his libraries).
find the proper sample size.
What I ask you is what are the best methods/functions to do all these things.
The only thing which I have reached to write is this.
pd.read_csv('pandas_tutorial_read.csv', delimiter=';')
Here is a problem very similar to what I have to do.
https://www.dropbox.com/sh/sy7vqq2x2740u9d/AACFap-NPA04znDMNX5W9wdza?dl=0
Thank you!

To read in a csv, you can use this code. Delimiter is not required if input file is comma separated.
df = pd.read_csv('path')
To count all rows, use shape attribute of df.
rows = df.shape[0]
To plot a bar graph use this.
import matplotlib.pyplot as plt
plt.bar(col1,col2)
If you mean regressing value by "expected value" use Imputer. You can find documentation online.
Quantiling can be done like this.
df[col].quantile([0,0.25,0.5,0.75])
Didn't understand what you mean by "sample size".
There are tonnes of documentation and tutorials out there. All the best!

Related

CSV file: Change the position of column to row and reorganize dataset [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
INTRODUCTION: I have a CSV file (fossil-fuel-input.csv) that contains the data on fossil fuels utilization by country and year. The layout of data (order of rows and columns) in this CSV is not in a proper way.
DESCRIPTION: The layout in which I want the data is shown in the second CSV file (fossil-fuel-output.csv). This second CSV only contains few records from the first CSV. Actually, I want all records from fossil-fuel-input.csv to be present in fossil-fuel-output.csv. Manually doing copy-paste is not feasible as it will take a lot of time.
QUESTION: How can I achieve this? Feel free to use any tool like Excel, Python, etc. (Tried V-lookup and Transpose in Excel, but, it didn't work for me).
You can do this with pandas by creating a pivot table:
import pandas as pd
df = pd.read_csv('https://gitlab.com/sysuin/datasets/-/raw/main/fossil-fuel-input.csv?inline=false')
df = df.pivot_table(values='Fossil Fuels (TWh)', index='Entity', columns='Year', aggfunc='first')
df.to_csv('output.csv')

Identifying consecutive declining values in a column from a data frame [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a 278 x 2 data frame, and I want to find the rows that have 2 consecutive declining values in the second column. Here's a snippet:
I'm not sure how to approach this problem. I've searched how to identify consecutive declining values in a data frame, but so far I've only found questions that pertain to consecutive SAME values, which isn't what I'm looking for. I could iterate over the data frame, but I don't believe that's very efficient.
Also, I'm not asking for someone to show me how to code this problem. I'm simply asking for potential ways I could go about solving this problem on my own because I'm unsure of how to approach the issue.
Use shift to create a temporary column with all values shifted up one row.
Compare the two columns, "GDP" > "shift" This gives you a new
column of Boolean values.
Look for consecutive True values in this Boolean column. That identifies two consecutive declining values.

How to add data for missing values [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
all.
I have a question on how to add missing values to a dataset object.
I'm currently working on crop growth modeling, and employ NASA Power API as a weather dataset.
However, the NASA Power dataset has missing days.
enter image description here
I used pcse library in order to extract NASA Power dataset.
My question is, how to add the missing day's data.
I tried
wdp(date) = wdp(date-timedelta(days=1))
but it gives me back 'can't assign to function call'
anyhow, it seems that the data for the missing date does not exist in the object and I am not allowed to make it.
You have the right idea, but the wrong syntax. In Python, list and dict access uses square brackets ([]), see the docs.
To add to that, pcse’s WeatherDataProvider object does not support this style access. Checking out the code in this link, it appears there is a method you can call named _store_WeatherDataContainer, where the leading _ indicates it is not intended for public use, but that doesn’t mean you can’t :-)
It should look like this:
wdp._store_WeatherDataContainer(wdp(date-timedelta(days=1)), date)

change individual linestyle when using pandas's plot [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have many dataframes(df) which have multiple varying number of columns and the first column is date, the rest of columns are the data I like to plot. I used df.plot() to plot the lines automatically. It is simple to use panda's plot function directly. However, for example, I like to change the linewidth of the first and 4th line or even only the first line. How to do it in pandas? I know how to do it using matplotlib by looping over each column to plot each line. what about just using pandas's plot function? Thanks
Maybe you can create a list with a fixed lenght size (depending of your DataFrame size):
list_of_line_width = [1] * len(df.columns)
The rest is just changing the size of the lines you are looking for:
list_of_line_width[index_position] = lenght_of_the_line

How to pull variables from line of data file in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a large data file where each row looks as follows, where each pipe-delimited value represents a consistent variable (i.e. 1517892812 and 1517892086 represent the Unix Timestamp, and the last pipe delimited object will always be UnixTimestamp)
264|2|8|6|1.32235000|1.33070000|1.31400000|1257.89480966|1517892812
399|10|36|2|1.12329614|1.12659227|1.12000000|148194.47200218|1517892086
How can I pull out the values I need to make variables in Python? For example, looking at a row and getting UnixTimestamp=1517892812 (and other variables) out of it.
I want to pull out each relevant variable per line, work with them, and then look at the next line and reevaluate all of the variable values.
Is RegEx what I should be dealing with here?
No need for regex, you can use split():
int(a.strip().split('|')[-1])
If all variable are only number and you want a matrix whit all your values you can simply do something like:
[int(line.strip().split('|')) for line in your_data.splitlines()]
You can use regex and re.search():
int(re.search(r'[^|]+$', text).group())

Categories

Resources