python pandas add a lower level column to multi_index dataframe - python

Could someone help me to achieve this task?
I have data in multi-level data frame through the unstack() operation:
Original df:
Density Length Range Count
15k 0.60 small 555
15k 0.60 big 17
15k 1.80 small 141
15k 1.80 big 21
15k 3.60 small 150
15k 3.60 big 26
20k 0.60 small 5543
20k 0.60 big 22
20k 1.80 small 553
20k 1.80 big 25
20k 3.60 small 422
20k 3.60 big 35
df = df.set_index(['Density','Length','Range']).unstack('Range')
# After unstack:
Count
Range big small
Density Length
15k 0.60 17 555
1.80 21 141
3.60 26 150
20k 0.60 22 5543
1.80 25 553
3.60 35 422
Now I try to add an extra column in level 1. it is the ratio of the small/big. I have tried the following syntax, no error but with different outcomes
#df[:]['ratio']=df['Count']['small']/df['Count']['big'] ## case 1. no error, no ratio
#df['Count']['ratio']=df['Count']['small']/df['Count']['big'] ## case 2. no error, no ratio
#df['ratio']=df['Count']['small']/df['Count']['big'] ## case 3. no error, ratio on column level 0
df['ratio']=df.ix[:,1]/df.ix[:,0] ## case 4. no error, ratio on column level 0
#After execution above code, df:
Count ratio
Range big small
Density Length
15k 0.60 17 555 32.65
1.80 21 141 6.71
3.60 26 150 5.77
20k 0.60 22 5543 251.95
1.80 25 553 22.12
3.60 35 422 12.06
I don't understand why case 1 & 2 show no error neither adding new ratio column. and why in case 3 & 4 the ratio column is on level 0, not the expected level 1. Also like to know if there is a better/concise way to achieve this. Case 4 is the best I can do but I don't like the implicit indexing way (instead of using the name) to refer to a column.
Thanks

Case 1:
df[:]['ratio']=df['Count']['small']/df['Count']['big']
df[:] is a copy of df. They are different objects, each with its own copy of the underlying data:
In [69]: df[:] is df
Out[69]: False
So modifying the copy has no effect on the original df. Since no reference is
maintained for df[:], the object is garbage collected after the assignment,
making the assignment useless.
Case 2:
df['Count']['ratio']=df['Count']['small']/df['Count']['big']
uses chain-indexing. Avoid chain indexing when making assignments. The link explains why assignments using chain-indexing on the left-hand side may not affect df.
If you set
pd.options.mode.chained_assignment = 'warn'
then Pandas will warn you not to use chain-indexing in assignments:
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Case 3:
df['ratio']=df['Count']['small']/df['Count']['big']
and Case 4
df['ratio']=df.ix[:,1]/df.ix[:,0]
both work, but it could done more efficently using
df['ratio'] = df['Count','small']/df['Count','big']
Here is a microbenchmark showing that using df[tuple_index] is faster than
chain-indexing:
In [99]: %timeit df['Count']['small']
1000 loops, best of 3: 501 µs per loop
In [128]: %timeit df['Count','small']
100000 loops, best of 3: 8.91 µs per loop
If you want ratio to be the level 1 label, then you must tell Pandas that the level 0 label is Count. You can do that by assigning to df['Count','ratio']:
In [96]: df['Count','ratio'] = df['Count']['small']/df['Count','big']
# In [97]: df
# Out[97]:
# Count
# Range big small ratio
# Density Length
# 15k 0.6 17 555 32.647059
# 1.8 21 141 6.714286
# 3.6 26 150 5.769231
# 20k 0.6 22 5543 251.954545
# 1.8 25 553 22.120000
# 3.6 35 422 12.057143

Related

pandas apply with assignment on large dataframe

I have a very long dataframe dfWeather with the column TEMP. Because of its size, I want to keep only relevant information. Concretely, keep only entries, where the temperature changed by more than 1 since the last entry I kept. I want to use dfWeather.apply, since it seems to iterate much faster (10x) over the rows than a for-loop over dfWeather.iloc. I tried the following.
dfTempReduced = pd.DataFrame(columns = dfWeather.columns)
dfTempReduced.append(dfWeather.iloc[0])
dfWeather.apply(lambda x: dfTempReduced = dfTempReduced.append(x) if np.abs(TempReduced[-1].TEMP - x.TEMP) >= 1 else None, axis = 1)
unfortunately I get the error
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Is there a fast way to get that desired result? Thanks!
EDIT:
Here is some example data
dfWeather[200:220].TEMP
Out[208]:
200 12.28
201 12.31
202 12.28
203 12.28
204 12.24
205 12.21
206 12.17
207 11.93
208 11.83
209 11.76
210 11.66
211 11.55
212 11.48
213 11.43
214 11.37
215 11.33
216 11.36
217 11.33
218 11.29
219 11.27
The desired result would yield only the first and the last entry, since the absolute difference is larger than 1. The first entry is always included.
If you don't want to call this recursive (so you have [1, 2, 3] and you want to keep [1, 3] because 2 is only 1 degree larger than 1 but 3 is more than 1 degree larger than 1, but not than 2) than you can simply use diff.
However, this doesn't work if the values stay longer below the 1°C threshold. To overcome this limitation, you could round the values (to whatever precision but 1°C suggests that to zero-precision would be a good idea;) )
Let us create an example:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['TEMP'] = np.random.rand(100) * 2
so now if you are OK with using diff it can be done very efficiently just by:
# either slice
lg = df['TEMP'].apply(round).diff().abs() > 1
df = df[lg]
# or drop
lg = df['TEMP'].apply(round).diff().abs() < 1
df.drop(index=lg.index, inplace=True)
You even have two options to to the reduction. I guess that drop take a minimal twink longer but is more memory efficient than the slicing way.

Why is the axes for the .mean() method in pandas the opposite in this scenario?

I have a dataframe, height_df, with three measurements, 'height_1','height_2','height_3'. I want to create a new column that has the mean of all three heights. A printout of height_df is given below
height_1 height_2 height_3
0 1.78 1.80 1.80
1 1.70 1.70 1.69
2 1.74 1.75 1.73
3 1.66 1.68 1.67
The following code works but I don't understand why
height_df['height'] = height_df[['height_1','height_2','height_3']].mean(axis=1)
I actually want the mean across the row axes, i.e. for each row compute the average of the three heights. I would have thought then that the axis argument in mean should be set to 0, as this is what corresponds to applying the mean across rows, however axis=1 is what gets the result I am looking for. Why is this? If axis=1 is for columns and axis=0 is for rows then why does .mean(axis=1) take the mean across rows?
Just need to tell mean to work across columns with axis=1
df = pd.DataFrame({"height_1":[1.78,1.7,1.74,1.66],"height_2":[1.8,1.7,1.75,1.68],"height_3":[1.8,1.69,1.73,1.67]})
df = df.assign(height_mean=df.mean(axis=1))
df = df.assign(height_mean=df.loc[:,['height_1','height_2','height_3']].mean(axis=1))
print(df.to_string(index=False))
output
height_1 height_2 height_3 height_mean
1.78 1.80 1.80 1.793333
1.70 1.70 1.69 1.696667
1.74 1.75 1.73 1.740000
1.66 1.68 1.67 1.670000

Txt to csv format with rows and columns [python]

Need help converting a txt file to csv with the rows and columns intact. The text file is here:
(http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2020&MONTH=06&FROM=2300&TO=2300&STNM=72265)
So far I only have this...
df = pd.read_csv('sounding-72265-2020010100.txt',delimiter=',')
df.to_csv('sounding-72265-2020010100.csv')
But it has only one column with all the other columns within its rows.
Instead want with to format it to something like this
CSV Format
Thanks for any help
I'm assuming you can start with text copied from the website; i.e. you create a data.txt file looking like the following by copy/pasting:
1000.0 8
925.0 718
909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
...
...
...
Then the following works, mainly based on this answer:
import pandas as pd
df = pd.read_table('data.txt', header=None, sep='\n')
df = df[0].str.strip().str.split('\s+', expand=True)
You read the data only separating by new lines, generating a one column df. Then use string methods to format the entries and expand them into a new DataFrame.
You can then add the column names in as such with help from this answer:
col1 = 'PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV'.split()
col2 = 'hPa m C C % g/kg deg knot K K K '.split()
df.columns = pd.MultiIndex.from_tuples(zip(col1,col2), names = ['Variable','Unit'])
The result (df.head()):
Variable PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
Unit hPa m C C % g/kg deg knot K K K
0 1000.0 8 None None None None None None None None None
1 925.0 718 None None None None None None None None None
2 909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
3 900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
4 883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
I would actually probably drop the "Units" column name were it me, b/c I think the multiindex columns can make things more complicated to slice.
Again, both reading the data and column names assume you can just copy paste those into a text file/into Python and then parse. If you are reading many pages like this, or were looking to do some sort of web scraping, that will require additional work.

Pandas diff SeriesGroupBy is relatively slow

Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)
If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Categories

Resources