I have a very long dataframe dfWeather with the column TEMP. Because of its size, I want to keep only relevant information. Concretely, keep only entries, where the temperature changed by more than 1 since the last entry I kept. I want to use dfWeather.apply, since it seems to iterate much faster (10x) over the rows than a for-loop over dfWeather.iloc. I tried the following.
dfTempReduced = pd.DataFrame(columns = dfWeather.columns)
dfTempReduced.append(dfWeather.iloc[0])
dfWeather.apply(lambda x: dfTempReduced = dfTempReduced.append(x) if np.abs(TempReduced[-1].TEMP - x.TEMP) >= 1 else None, axis = 1)
unfortunately I get the error
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Is there a fast way to get that desired result? Thanks!
EDIT:
Here is some example data
dfWeather[200:220].TEMP
Out[208]:
200 12.28
201 12.31
202 12.28
203 12.28
204 12.24
205 12.21
206 12.17
207 11.93
208 11.83
209 11.76
210 11.66
211 11.55
212 11.48
213 11.43
214 11.37
215 11.33
216 11.36
217 11.33
218 11.29
219 11.27
The desired result would yield only the first and the last entry, since the absolute difference is larger than 1. The first entry is always included.
If you don't want to call this recursive (so you have [1, 2, 3] and you want to keep [1, 3] because 2 is only 1 degree larger than 1 but 3 is more than 1 degree larger than 1, but not than 2) than you can simply use diff.
However, this doesn't work if the values stay longer below the 1°C threshold. To overcome this limitation, you could round the values (to whatever precision but 1°C suggests that to zero-precision would be a good idea;) )
Let us create an example:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['TEMP'] = np.random.rand(100) * 2
so now if you are OK with using diff it can be done very efficiently just by:
# either slice
lg = df['TEMP'].apply(round).diff().abs() > 1
df = df[lg]
# or drop
lg = df['TEMP'].apply(round).diff().abs() < 1
df.drop(index=lg.index, inplace=True)
You even have two options to to the reduction. I guess that drop take a minimal twink longer but is more memory efficient than the slicing way.
I have a dataframe, height_df, with three measurements, 'height_1','height_2','height_3'. I want to create a new column that has the mean of all three heights. A printout of height_df is given below
height_1 height_2 height_3
0 1.78 1.80 1.80
1 1.70 1.70 1.69
2 1.74 1.75 1.73
3 1.66 1.68 1.67
The following code works but I don't understand why
height_df['height'] = height_df[['height_1','height_2','height_3']].mean(axis=1)
I actually want the mean across the row axes, i.e. for each row compute the average of the three heights. I would have thought then that the axis argument in mean should be set to 0, as this is what corresponds to applying the mean across rows, however axis=1 is what gets the result I am looking for. Why is this? If axis=1 is for columns and axis=0 is for rows then why does .mean(axis=1) take the mean across rows?
Just need to tell mean to work across columns with axis=1
df = pd.DataFrame({"height_1":[1.78,1.7,1.74,1.66],"height_2":[1.8,1.7,1.75,1.68],"height_3":[1.8,1.69,1.73,1.67]})
df = df.assign(height_mean=df.mean(axis=1))
df = df.assign(height_mean=df.loc[:,['height_1','height_2','height_3']].mean(axis=1))
print(df.to_string(index=False))
output
height_1 height_2 height_3 height_mean
1.78 1.80 1.80 1.793333
1.70 1.70 1.69 1.696667
1.74 1.75 1.73 1.740000
1.66 1.68 1.67 1.670000
Need help converting a txt file to csv with the rows and columns intact. The text file is here:
(http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2020&MONTH=06&FROM=2300&TO=2300&STNM=72265)
So far I only have this...
df = pd.read_csv('sounding-72265-2020010100.txt',delimiter=',')
df.to_csv('sounding-72265-2020010100.csv')
But it has only one column with all the other columns within its rows.
Instead want with to format it to something like this
CSV Format
Thanks for any help
I'm assuming you can start with text copied from the website; i.e. you create a data.txt file looking like the following by copy/pasting:
1000.0 8
925.0 718
909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
...
...
...
Then the following works, mainly based on this answer:
import pandas as pd
df = pd.read_table('data.txt', header=None, sep='\n')
df = df[0].str.strip().str.split('\s+', expand=True)
You read the data only separating by new lines, generating a one column df. Then use string methods to format the entries and expand them into a new DataFrame.
You can then add the column names in as such with help from this answer:
col1 = 'PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV'.split()
col2 = 'hPa m C C % g/kg deg knot K K K '.split()
df.columns = pd.MultiIndex.from_tuples(zip(col1,col2), names = ['Variable','Unit'])
The result (df.head()):
Variable PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
Unit hPa m C C % g/kg deg knot K K K
0 1000.0 8 None None None None None None None None None
1 925.0 718 None None None None None None None None None
2 909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
3 900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
4 883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
I would actually probably drop the "Units" column name were it me, b/c I think the multiindex columns can make things more complicated to slice.
Again, both reading the data and column names assume you can just copy paste those into a text file/into Python and then parse. If you are reading many pages like this, or were looking to do some sort of web scraping, that will require additional work.
Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)
If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47