I have several monthly reports that each look something like this:
data = [['Location 1', 11, 25, 32, 67], ['Location2', 18, 23, 47, 70], ['Location3', 20, 34, 28, 57], ['Location 1', 23, 35, 40, 54]]
df = pd.DataFrame(data, columns=['Location', '# of Apples', '# of Fruits', '# of Carrots', '# of Vegetables'])
Location # of Apples # of Fruits # of Carrots # of Vegetables
Location 1 11 25 32 67
Location 2 18 23 47 70
Location 3 20 34 28 57
Location 1 23 35 40 54
I need to read these reports into one file and create a table that has the location, % of Apple (# of Apples/# of Fruits * 100) and % of Carrots (# of Carrots/# of Vegetables * 100) for each month/report, looking something like this:
January February
Location % of Apples % of Carrots % of Apples % of Carrots
Location 1 56.7% 59.5% 48.7% 53.8%
Location 2 78.3% 67.1% 73.5% 70.8%
Location 3 58.8% 74.1% 59.2% 72.3%
I tried using pd.pivot_table, which gave me the correct format with months going across the top, but I don't know how to calculate the percent values from here.
pivot_table = pd.pivot_table(df, values=['# of Apples', '# of Fruits', '# of Carrots', '# of Vegetables'], index=['Location'], columns=['Month'], aggfunc=np.sum)
I also tried this, which gave me the percent values, but isn't in the correct format.
df = df.grouby(['Month', 'Location']).sum()
pivot_table['% of Apples'] = df['# of Apples'] / df['# of Fruit]') * 100
pivot_table['% of Carrots'] = df['# of Carrots'] / df['# of Vegetables']) * 100
A note about the months, they are not included in the original data since each report is for one month. When I read the reports in, I add a column that includes the file name. I then replace that with the month.
Thank you!
My usual approach is just to take it in steps, getting the correct data first, then using pivot to change the layout (de-normalise) at the very end.
I find this useful as the more normalised the data is, the easier it is to re-use in future data-manipulation.
Initial data:
import pandas as pd
data = [
['Jan', 'Location 1', 11, 25, 32, 67],
['Jan', 'Location2', 18, 23, 47, 70],
['Jan', 'Location3', 20, 34, 28, 57],
['Jan', 'Location 1', 23, 35, 40, 54],
['Feb', 'Location 1', 11, 25, 32, 67],
['Feb', 'Location2', 18, 23, 47, 70],
['Feb', 'Location3', 20, 34, 28, 57],
['Feb', 'Location 1', 23, 35, 40, 54],
]
df = pd.DataFrame(data, columns=['Month', 'Location', '# of Apples', '# of Fruits', '# of Carrots', '# of Vegetables'])
print(df)
print()
print()
Processing and transformation:
df = df.groupby(['Month', 'Location'] , as_index=False).sum()
print(df)
print()
print()
df['% Apples'] = df['# of Apples' ] / df['# of Fruits' ]
df['% Carrots'] = df['# of Carrots'] / df['# of Vegetables']
print(df)
print()
print()
df = df.pivot(index='Location', columns='Month', values=['% Apples', '% Carrots'])
print(df)
print()
print()
https://trinket.io/python3/ab8cc6ced4
Related
This seem like it should be easy, but can not seem to get it working.
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'],
'Age':[31, 46, 21, 37, 31, 46, 21, 37],
'Times':[20, 21, 19, 18, 19, 20, 20, 19]}
df = pd.DataFrame(data)
df
# basic boxplot for 'Times'
df['Times'].plot(kind='box')
# Filtered version
filt = df['Name'] == 'Tom'
df.loc[filt, 'Times'].plot(kind='box')
# comparing two columns is easy but I want to compare the same column with different row filters.
df[['Times', 'Age']].plot(kind='box')
So how to I compare these two versions of the same column side by side?
Many thanks
You simply pass a list to plt.boxplot():
box = plt.boxplot([df['Times'], df[df['Name'] == 'Tom']['Times']],
labels=['all','Toms'])
I compared Tom, Others, and All
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'],
'Age':[31, 46, 21, 37, 31, 46, 21, 37],
'Times':[20, 21, 19, 18, 19, 20, 20, 19]}
df = pd.DataFrame(data)
print(df)
df.boxplot(column='Times', by='Age')
grouped=df.groupby(['Name','Times']).any().unstack().reset_index().transpose()
df2=pd.DataFrame(grouped)
new_header = df2.iloc[0]
df2 = df2[1:]
df2.columns = new_header
df2.reset_index(inplace=True)
others=[x for x in df2.columns if x not in(['Tom','Times'])]
all=[x for x in df2.columns if x not in(['Times'])]
df2['Others']=df2[others].any(axis=1)
df2['All']=df2[all].any(axis=1)
print(df2.columns)
print(df2)
df2.boxplot(column='Times',by=['Others'])
df2.boxplot(column='Times',by=['Tom'])
df2.boxplot(column='Times',by=['All'])
plt.show()
A similar approach with the accepted answer, no need to hardcode the names
import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'],
'Age':[31, 46, 21, 37, 31, 46, 21, 37],
'Times':[20, 21, 19, 18, 19, 20, 20, 19]}
df = pd.DataFrame(data)
df_list = [df["Times"]]
labels_list = ["all"]
# if you dont want all, just set them to empty list
#df_list = []
#labels_list = []
grouped_df = df.groupby("Name")
for name in grouped_df.groups.keys():
labels_list.append(name)
df_list.append(grouped_df.get_group(name)["Times"])
plt.boxplot(df_list, labels = labels_list)
plt.show()
for name in grouped_df.groups.keys():
labels_list.append(name)
df_list.append(grouped_df.get_group(name)["Times"])
plt.boxplot(df_list, labels = labels_list)
plt.show()
here is the result
This question already has answers here:
Convert Python dict into a dataframe
(18 answers)
Closed 2 years ago.
{'student1': 45,
'student2': 78,
'student3': 12,
'student4': 14,
'student5': 48,
'student6': 43,
'student7': 47,
'student8': 98,
'student9': 35,
'student10': 80}
How to convert this dict into a dataframe
import pandas as pd
student = {
"student1": 45,
"student2": 78,
"student3": 12,
"student4": 14,
"student5": 48,
"student6": 43,
"student7": 47,
"student8": 98,
"student9": 35,
"student10": 80,
}
df = pd.DataFrame(student.items(), columns=["name", "score"])
print(df)
name score
0 student1 45
1 student2 78
2 student3 12
3 student4 14
4 student5 48
5 student6 43
6 student7 47
7 student8 98
8 student9 35
9 student10 80
import pandas as pd
# intialise data of lists. where each key will be your column
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# or list of dicts
data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}]
if you are getting scalar error
do this
import pandas as pd
data = {'student1': 45, 'student2': 78, 'student3': 12, 'student4': 14, 'student5': 48, 'student6': 43, 'student7': 47, 'student8': 98, 'student9': 35, 'student10': 80}
for i in data.keys():
data[i] = [data[i]]
df = pd.DataFrame(data)
df.head()
This should do the trick
df = DataFrame(list(my_dict.items()),columns = ['column1','column2'])
pd.DataFrame(dict_.items())
pd.DataFrame(dict_.items(), columns=['Student', 'Point'])
pd.Series(dict_, name='StudentValue')
All will work.
I would like to merge (using how = 'left') whereas dataframe_A is on the left and data_frame_B is on the right. The Column/index level names to join on are "name","weight" and money". The height and weight difference is allow up to 2 cm.
I am not using for loop as my dataset is too big, it will take 2 days to complete
E.g.
INPUT
dataframe_A name:John, height: 170, weight :70
dataframe_B name:John, height 172, weight :69
OUTPUT
output_dataframe : name:John,height: 170, weight :70, money:100, grade :1
I have two dataframe :
dataframe_A = pd.DataFrame({'name': ['John', 'May', 'Jane', 'Sally'],
'height': [170, 180, 160, 155],
'weight': [70, 88, 60, 65],
'money': [100, 1120, 2000, 3000]})
dataframe_B = pd.DataFrame({'name': ['John', 'May', 'Jane', 'Sally'],
'height': [172, 180, 160, 155],
'weight': [69, 88, 60, 65],
'grade': [1, 2, 3, 4]})
In selecting statment should be,
SELECT * FROM dataframe_A LEFT JOIN dataframe_B
ON dataframe_A.name= dataframe_B.name and
dataframe_A.height => dataframe_B.height+2 or
dataframe_A.height <= dataframe_B.height-2 and
dataframe_A.weight=> dataframe_B.weight+2 or
dataframe_A.weight<= dataframe_B.weight-2
;
But I am unsure how to put it in python as i am still learning
output_dataframe =pd.merge(dataframe_A,dataframe_B,how='left',on=['name','height','weight'] + ***the range condition***
Use merge first and then filter by boolean indexing with Series.between:
df = pd.merge(dataframe_A, dataframe_B, on='name', how='left', suffixes=('','_'))
m1 = df['height'].between(df['height_'] - 2, df['height_'] + 2)
m2 = df['weight'].between(df['weight_'] - 2, df['weight_'] + 2)
df = df.loc[m1 & m2, dataframe_A.columns.tolist() + ['grade']]
print (df)
name height weight money grade
0 John 170 70 100 1
1 May 180 88 1120 2
2 Jane 160 60 2000 3
3 Sally 155 65 3000 4
#code source
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=50,
n_features=6,
n_informative=3,
n_classes=2,
random_state=10,
shuffle=True)
# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
'Feature 2':X[:,1],
'Feature 3':X[:,2],
'Feature 4':X[:,3],
'Feature 5':X[:,4],
'Feature 6':X[:,5],
'Class':y})
values = [i for i,x in enumerate(df['Class']) if x == 0]
print(values)
The output is
[5, 6, 9, 11, 13, 14, 17, 18, 20, 21, 23, 24, 25, 26, 27, 31, 32, 34,
41, 42, 44, 45, 46, 47, 49]
I am trying to group the above output based on the condition that numbers come in concurrent value . Such as the output should be:
Group 1: 5,6
Group 2: 9
Group 3: 11
Group 4: 13,14
..
..
Group n: 23,24,25,26,27
I am grouping them to have an understanding of the gaps in the column, instead of having a slab of values following each other in a list.
I think need Series, get differences by diff, compare by gt and last create groups by cumsum to new Series which is used as by argument of groupby:
values = [5, 6, 9, 11, 13, 14, 17, 18, 20, 21, 23,
24, 25, 26, 27, 31, 32, 34, 41, 42, 44, 45, 46, 47, 49]
s = pd.Series(values)
s1 = s.groupby(s.diff().gt(1).cumsum() + 1).apply(lambda x: ','.join(x.astype(str)))
print (s1)
1 5,6
2 9
3 11
4 13,14
5 17,18
6 20,21
7 23,24,25,26,27
8 31,32
9 34
10 41,42
11 44,45,46,47
12 49
dtype: object
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm','Budapest_PaRis', 'Brussels_londOn'],
'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']})
df
Airline FlightNumber From_To RecentDelays
0 KLM(!) 10045.0 LoNDon_paris [23, 47]
1 <Air France> (12) NaN MAdrid_miLAN []
2 (British Airways. ) 10065.0 londON_StockhOlm [24, 43, 87]
3 12. Air France NaN Budapest_PaRis [13]
4 "Swiss Air" 10085.0 Brussels_londOn [67, 32]
Some values in the the FlightNumber column are missing. These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Fill in these missing numbers and make the column an integer column (instead of a float column).
Hopefully this would work.
for i in range(1, df['FlightNumber'].count() + 1):
if pd.isnull(df.loc[i,'FlightNumber']):
df.loc[i, 'FlightNumber'] = df.loc[i-1, 'FlightNumber'] + 10
Try this code:-
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)