Create a pandas dataframe from dictionary whilst maintaining order of columns - python

When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)

One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52

Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)

If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())

Related

Numpy Vectorized Window Operations

I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69

I need help comparing data within a table in python

I have the following table
a1b1
a1Eb1
a1b2
a1Eb2
a2b1
a2Eb1
a2b2
a2Eb2
a3b1
a3Eb1
a3b2
a3Eb2
2
20
8
54
3
56
3
67
2
78
7
75
8
30
6
67
6
35
4
56
3
85
6
74
5
54
4
64
7
23
6
48
4
67
4
82
6
65
7
53
8
27
7
35
5
25
3
64
4
34
2
52
4
28
8
27
6
94
2
29
i want to compare the following data:
a1b1 vs a1b2;
then generate arrays containing
a1b1
a1b2
minor a1b1
2
8
20
a1b2
a1b1
minor a1b2
6
8
30
and so for each row of the table
and for each of the following comparisons
a2b1 vs a2b2;
a3b1 vs a3b2;
I have tried to do it with pandas in python
import pandas as pd
import numpy as np
df = pd.DataFrame ({'a1b1':[2,8,5,6,4],
'a1Eb1':[20,30,54,65,34],
'a1b2':[8,6,4,7,2],
'a1Eb2':[54,67,64,53,52],
'a2b1':[3,6,7,8,4],
'a2Eb1':[56,35,23,27,28],
'a2b2':[3,4,6,7,8],
'a2Eb2':[67,56,48,35,27],
'a3b1':[2,3,4,5,6],
'a3Eb1':[78,85,67,25,94],
'a3b2':[7,6,4,3,2],
'a3Eb3':[75,74,82,64,29],
})
but i don't know how to go on.
Output expected
To the first line a1b1<a1b2 then print the following
df1=pd.DataFrame{'a1b1':[2],
'a1b2':[8],
'a1Eb1':[20]}
This can be, a DataFrame, a list or any data structure
If you want to display only specific columns of your dataframe you can use the following syntax with [[ and ]] after the name of the dataframe (df), and in between you just add the names of the columns you want to see. It can be 2,
3 or even all of the columns of the dataframes, as long as you separate their names with a comma and put them between quotes.
df[['a1b1','a1b2']] # to display two columns
df[['a2b1','a2b2']]
df[['a3b1','a3b2']]
to display 3 columns, it could for example be :
df[['a3b1','a3b2','a3b1']]
and so on.

How to obtain the first 4 rows for every 20 rows from a CSV file

I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.
Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)
You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.
You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.

Concat function not giving desired result

I want to add dataframe to excel every time the code executes, in the last row available in the sheet. Here is the code I am using:
import pandas as pd
import pandas
from openpyxl import load_workbook
def append_df_to_excel(df, excel_path):
df_excel = pd.read_excel(excel_path)
result = pd.concat([df_excel, df], ignore_index=True)
result.to_excel(excel_path)
data_set1 = {
'Name': ['Rohit', 'Mohit'],
'Roll no': ['01', '02'],
'maths': ['93', '63']}
df1 = pd.DataFrame(data_set1)
append_df_to_excel(df1, r'C:\Users\kashk\OneDrive\Documents\ScreenStocks.xlsx')
My desired output(after 3 code runs):
Rohit 1 93
Mohit 2 63
Rohit 1 93
Mohit 2 63
Rohit 1 93
Mohit 2 63
But what I get:
Unnamed: 0.1 Unnamed: 0 Name Roll no maths
0 0 0 Rohit 1 93
1 1 1 Mohit 2 63
2 2 Rohit 1 93
3 3 Mohit 2 63
4 Rohit 1 93
5 Mohit 2 63
Not sure where I am going wrong.
It's happening because in a default situation these functions like to_excel or to_csv (and etc.) add a new column with index. So every time you save the file, it adds a new column.
That's why you just should change the raw where you save your dataframe to a file.
result.to_excel(excel_path, index=False)

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

Categories

Resources