Finding average value in a single data frame using pandas - python

import pandas as pd
import csv
l=[]
url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
for i in range(1981,2018):
df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
l.append(df.loc[9])
print(pd.concat(l))
pd.concat(l)
Region CONUS
19810101 28
19810102 29
19810103 33
19810104 37
19810105 38
19810106 33
19810107 31
19810108 36
19810109 37
19810110 36
...
20171227 37
20171228 38
20171229 35
20171230 34
20171231 40
Name: 9, Length: 13551, dtype: object
>>>
This code will give temperature from 1981 to 2017, and I am trying to find the average value of each month
pd.concat(l).mean() didn't work....
Can anyone help me on this issue? Thank you!

Related

Pandas dataframe Plotly line chart with two lines

I have a pandas dataframe as below and I would like to produce a few charts with the data. 'Name' column are the names of the accounts, 'Number' column is the number of users under each count, and the months columns are the login times of each account in every month.
Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9
1st: I would like to monitor the trend of logins from January to May. One line illustrates Bianca's login and the other line illustrates everyone else's login.
2nd: I would like to monitor the percentage change of logins from January to May. One line illustrates Bianca's login percentage change and the other line illustrates everyone else's login percentage change.
Thank you for your time and assistance. I'm a beginner at this. I appreciate any help on this! Much appreciated!!
I suggest best approach to group is use categoricals. pct_change is not a direct aggregate function so it's a bit more involved to get it.
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9"""), sep="\s+")
# just setup 2 plot areas
fig, ax = plt.subplots(1,2, figsize=[20,5])
# want to divide data into 2 groups
df["grp"] = pd.Categorical(df["Acc"], ["Bianca","Others"])
df["grp"].fillna("Others", inplace=True)
# just get it out of the way...
df.drop(columns="User", inplace=True)
# simple plot where function exists directly. Not transform to get lines..
df.groupby("grp").sum().T.plot(ax=ax[0])
# a bit more sophisticated to get pct change...
df.groupby("grp").sum().T.assign(
Bianca=lambda x: x["Bianca"].pct_change().fillna(0)*100,
Others=lambda x: x["Others"].pct_change().fillna(0)*100
).plot(ax=ax[1])
output

How can I multiply a numpy array with pandas series?

I have a numpy series of size (50,0)
array([1.01255569e+00, 1.04166667e+00, 1.07158165e+00, 1.10229277e+00,
1.13430127e+00, 1.16387337e+00, 1.20365912e+00, 1.24007937e+00,
1.27877238e+00, 1.31856540e+00, 1.35281385e+00, 1.40291807e+00,
1.45180023e+00, 1.49700599e+00, 1.55183116e+00, 1.60051216e+00,
1.66002656e+00, 1.73370319e+00, 1.80115274e+00, 1.87687688e+00,
1.95312500e+00, 2.04750205e+00, 2.14961307e+00, 2.23613596e+00,
2.34082397e+00, 2.48015873e+00, 2.61780105e+00, 2.75027503e+00,
2.91715286e+00, 3.07881773e+00, 3.31564987e+00, 3.57142857e+00,
3.81679389e+00, 4.17362270e+00, 4.51263538e+00, 4.95049505e+00,
5.59284116e+00, 6.17283951e+00, 7.02247191e+00, 8.03858521e+00,
9.72762646e+00, 1.17370892e+01, 1.47928994e+01, 2.10084034e+01,
3.12500000e+01, 4.90196078e+01, 9.25925926e+01, 2.08333333e+02,
5.00000000e+02, 1.25000000e+03])
And I have a pandas dataframe of length 50 as well.
x
0 9.999740e-01
1 9.981870e-01
2 9.804506e-01
3 9.187764e-01
4 8.031568e-01
5 6.544660e-01
6 5.032716e-01
7 3.707446e-01
8 2.650768e-01
9 1.857835e-01
10 1.285488e-01
11 8.824506e-02
12 6.030141e-02
13 4.111080e-02
14 2.800453e-02
15 1.907999e-02
16 1.301045e-02
17 8.882996e-03
18 6.074386e-03
19 4.161024e-03
20 2.855636e-03
21 1.963543e-03
22 1.352791e-03
23 9.338596e-04
24 6.459459e-04
25 4.476854e-04
26 3.108912e-04
27 2.163201e-04
28 1.508106e-04
29 1.053430e-04
30 7.372442e-05
31 5.169401e-05
32 3.631486e-05
33 2.555852e-05
34 1.802129e-05
35 1.272995e-05
36 9.008454e-06
37 6.386289e-06
38 4.535381e-06
39 3.226546e-06
40 2.299394e-06
41 1.641469e-06
42 1.173785e-06
43 8.407618e-07
44 6.032249e-07
45 4.335110e-07
46 3.120531e-07
47 2.249870e-07
48 1.624726e-07
49 1.175140e-07
And I want to multiply every numpy cells with pandas cell.
Example:
1.01255569e+00*9.999740e-01
1.04166667e+00*9.981870e-01
Desired output
numpy array of same size.
You can just use the .values property of the 'x' series in your Pandas dataframe:
df['x'].values * arr
where df is your dataframe and arr is your array.
The above expression will return the result as a Numpy array. If you want a Pandas DataFrame instead, you can omit the use of .values:
df['x'] * arr
Or np.multiply, multiply n with p['x'].values:
print(np.multiply(n,p['x'].values))
Or pd.Series.multiply:
print(np.array(p['x'].multiply(n)))
Or pd.Series.mul:
print(np.array(p['x'].mul(n)))

Remove Unnamed columns in pandas dataframe [duplicate]

This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 4 years ago.
I have a data file from columns A-G like below but when I am reading it with pd.read_csv('data.csv') it prints an extra unnamed column at the end for no reason.
colA ColB colC colD colE colF colG Unnamed: 7
44 45 26 26 40 26 46 NaN
47 16 38 47 48 22 37 NaN
19 28 36 18 40 18 46 NaN
50 14 12 33 12 44 23 NaN
39 47 16 42 33 48 38 NaN
I have seen my data file various times but I have no extra data in any other column. How I should remove this extra column while reading ? Thanks
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
In [162]: df
Out[162]:
colA ColB colC colD colE colF colG
0 44 45 26 26 40 26 46
1 47 16 38 47 48 22 37
2 19 28 36 18 40 18 46
3 50 14 12 33 12 44 23
4 39 47 16 42 33 48 38
NOTE: very often there is only one unnamed column Unnamed: 0, which is the first column in the CSV file. This is the result of the following steps:
a DataFrame is saved into a CSV file using parameter index=True, which is the default behaviour
we read this CSV file into a DataFrame using pd.read_csv() without explicitly specifying index_col=0 (default: index_col=None)
The easiest way to get rid of this column is to specify the parameter pd.read_csv(..., index_col=0):
df = pd.read_csv('data.csv', index_col=0)
First, find the columns that have 'unnamed', then drop those columns. Note: You should Add inplace = True to the .drop parameters as well.
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
The pandas.DataFrame.dropna function removes missing values (e.g. NaN, NaT).
For example the following code would remove any columns from your dataframe, where all of the elements of that column are missing.
df.dropna(how='all', axis='columns')
The approved solution doesn't work in my case, so my solution is the following one:
''' The column name in the example case is "Unnamed: 7"
but it works with any other name ("Unnamed: 0" for example). '''
df.rename({"Unnamed: 7":"a"}, axis="columns", inplace=True)
# Then, drop the column as usual.
df.drop(["a"], axis=1, inplace=True)
Hope it helps others.

Pandas sort() ignoring negative sign

I want to sort a pandas df but I'm having problems with the negative values.
import pandas as pd
df = pd.read_csv('File.txt', sep='\t', header=None)
#Suppress scientific notation (finally)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
print(df)
print(df.dtypes)
print(df.shape)
b = df.sort(axis=0, ascending=True)
print(b)
This gives me the ascending order but completely disregards the sign.
SPATA1 -0.00000005
HMBOX1 0.00000005
SLC38A11 -0.00000005
RP11-571M6.17 0.00000004
GNRH1 -0.00000004
PCDHB8 -0.00000004
CXCL1 0.00000004
RP11-48B3.3 -0.00000004
RNFT2 -0.00000004
GRIK3 -0.00000004
ZNF483 0.00000004
RP11-627G18.1 0.00000003
Any ideas what I'm doing wrong?
Thanks
Loading your file with:
df = pd.read_csv('File.txt', sep='\t', header=None)
Since sort(....) is deprecated, you can use sort_values:
b = df.sort_values(by=[1], axis=0, ascending=True)
where [1] is your column of values. For me this returns:
0 1
0 ACTA1 -0.582570
1 MT-CO1 -0.543877
2 CKM -0.338265
3 MT-ND1 -0.306239
5 MT-CYB -0.128241
6 PDK4 -0.119309
8 GAPDH -0.090912
9 MYH1 -0.087777
12 RP5-940J5.9 -0.074280
13 MYH2 -0.072261
16 MT-ND2 -0.052551
18 MYL1 -0.049142
19 DES -0.048289
20 ALDOA -0.047661
22 ENO3 -0.046251
23 MT-CO2 -0.043684
26 RP11-799N11.1 -0.034972
28 TNNT3 -0.032226
29 MYBPC2 -0.030861
32 TNNI2 -0.026707
33 KLHL41 -0.026669
34 SOD2 -0.026166
35 GLUL -0.026122
42 TRIM63 -0.022971
47 FLNC -0.018180
48 ATP2A1 -0.017752
49 PYGM -0.016934
55 hsa-mir-6723 -0.015859
56 MT1A -0.015110
57 LDHA -0.014955
.. ... ...
60 RP1-178F15.4 0.013383
58 HSPB1 0.014894
54 UBB 0.015874
53 MIR1282 0.016318
52 ALDH2 0.016441
51 FTL 0.016543
50 RP11-317J10.2 0.016799
46 RP11-290D2.6 0.018803
45 RRAD 0.019449
44 MYF6 0.019954
43 STAC3 0.021931
41 RP11-138I1.4 0.023031
40 MYBPC1 0.024407
39 PDLIM3 0.025442
38 ANKRD1 0.025458
37 FTH1 0.025526
36 MT-RNR2 0.025887
31 HSPB6 0.027680
30 RP11-451G4.2 0.029969
27 AC002398.12 0.033219
25 MT-RNR1 0.040741
24 TNNC1 0.042251
21 TNNT1 0.047177
17 MT-ND3 0.051963
15 MTND1P23 0.059405
14 MB 0.063896
11 MYL2 0.076358
10 MT-ND5 0.076479
7 CA3 0.100221
4 MT-ND6 0.140729
[18152 rows x 2 columns]

Efficient way to iterate through a large dataframe

I have a csv file that contains several thousand records of company stock data. It contains the following integer fields:
low_price, high_price, volume_traded
10, 20, 45667
15, 22, 256565
41, 47, 45645
30, 39, 547343
My requirement is to create a new csv file from this data by accumulating the volume_traded at every price level (from low to high). The final result would just be two columns as follows:
price, total_volume_traded
10, 45667
11, 45667
12, 45667
....
....
15, 302232
etc
In other words the final csv contains one record for every price level (not just the high/low but also the prices in-between), along with the total amount of volume_traded at that price level.
I have got this working however it is terribly slow and inefficient. Im sure there must be better ways of accomplishing this.
Basically what i've done is use nested loops:
First iterate through each row.
On each row, create a nested loop to iterate through the price range from low_price to high_price.
Check if the price already exists in the new dataframe, if so add the current volume_traded to it. If it doesn't exist, apppend the price and volume (ie: create a new row).
Below is some of the relevant code. I would be grateful if anyone could advise a better way of doing this in terms of efficiency/speed:
df_exising = #dataframe created from existing csv
df_new = #dataframe for new Price/Volume values
for index, row in df_existing.iterrows():
price = row['low_price']
for i in range(row['low_price'], row['high_price']+1):
volume = row['volume_traded']
df_new = accumulate_volume(df_new, price, volume)
price+=1
def accumulate_volume(df_new, price, volume):
#If price level already exists, add volume to existing
if df_new['Price'].loc[df_new['Price'] == price].count() > 0:
df_new['Volume'].loc[df_new['Price'] == price] += volume
return(df_new)
else:
#first occurrence of price level, add new row
tmp = {'Price':int(price), 'Volume':volume}
return(df_new.append(tmp, ignore_index=True))
#once the above finishes, df_new is written to the new csv file
My guess for why this is so slow is at least partly because 'append' creates a new object every time it's called, and it gets called a LOT. In total, the nested loop from the above code gets run 1595653 times.
I would be very grateful for any assistance.
Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).
Below are a set of commented steps that should achieve your goal:
# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})
# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}
# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume
# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]
# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')
>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645
I would first groupby the 'low_price' column, then sum up the volume_traded, reset the index. This will effectively accumulate all the prices of interest, then you want to sort by the price, this makes the prices monotonic so that we can use it as the index. After setting to be the index we can then call reindex and compute a new index and fill the missing values using method='pad':
In [33]:
temp="""low_price,high_price,volume_traded
10,20,45667
15,22,256565
41,47,45645
10,20,12345
30,39,547343"""
df = pd.read_csv(io.StringIO(temp))
df
Out[33]:
low_price high_price volume_traded
0 10 20 45667
1 15 22 256565
2 41 47 45645
3 10 20 12345
4 30 39 547343
In [34]:
df1 = df.groupby('low_price')['volume_traded'].sum().reset_index()
df1
Out[34]:
low_price volume_traded
0 10 58012
1 15 256565
2 30 547343
3 41 45645
In [36]:
df1.sort(['low_price']).set_index(['low_price']).reindex(index = np.arange(df1['low_price'].min(), df1['low_price'].max()+1), method='pad')
Out[36]:
volume_traded
low_price
10 58012
11 58012
12 58012
13 58012
14 58012
15 256565
16 256565
17 256565
18 256565
19 256565
20 256565
21 256565
22 256565
23 256565
24 256565
25 256565
26 256565
27 256565
28 256565
29 256565
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 547343
41 45645

Categories

Resources