my dataframe is something like this
> 93 40 73 41 115 74 59 98 76 109 43 44
105 119 56 62 69 51 50 104 91 78 77 75
119 61 106 105 102 75 43 51 60 114 91 83
It has 8000 rows and 12 columns
I wanted to find the least frequent value in this whole dataframe (not only in columns).
I tried converting this dataframe into numpy array and use for loop to count the numbers and then return the least count number but it it not very optimal. I searched if there are any other methods but could not find it.
I only found scipy.stats.mode which returns the most frequent number.
is there any other way to do it?
You could stack and take the value_counts:
df.stack().value_counts().index[-1]
# 69
value_counts orders by frequency, so you can just take the last, though in this example many appear just once. 69 happens to be the last.
Another way using pandas.DataFrame.apply with pandas.Series.value_counts:
df.apply(pd.Series.value_counts).sum(1).idxmin()
# 40
# There are many values with same frequencies.
To my surprise, apply method seems to be the fastest among the methods I've tried (reason why I'm posting):
df2 = pd.DataFrame(np.random.randint(1, 1000, (500000, 100)))
%timeit df2.apply(pd.Series.value_counts).sum(1).idxmin()
# 2.36 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2.stack().value_counts().index[-1]
# 3.02 s ± 86.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
uniq, cnt = np.unique(df2, return_counts=True)
uniq[np.argmin(cnt)]
# 2.77 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As opposed to my understanding of apply being very slow, it even outperformed numpy.unique (perhaps my coding is wrong tho ;().
Related
I have a dataframe that has similar ids with spatiotemporal data like below:
car_id lat long
xxx 32 150
xxx 33 160
yyy 20 140
yyy 22 140
zzz 33 70
zzz 33 80
. . .
I want to replace car_id with car_1, car_2, car_3, ... However, my dataframe is large and it's not possible to do it manually by name so first I made a list of all unique values in the car_id column and made a list of names that should be replaced with:
u_values = [i for i in df['car_id'].unique()]
r = ['car'+str(i) for i in range(len(u_values))]
Now I'm not sure how to replace all unique numbers in car_id column with list values so the result is like this:
car_id lat long
car_1 32 150
car_1 33 160
car_2 20 140
car_2 22 140
car_3 33 70
car_3 33 80
. . .
The answers so far seem a little complicated to me, so here's another suggestion. This creates a dictionary that has the old name as the keys and the new name as the values. That can be used to map the old values to new values.
r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}
df['car_id'] = df['car_id'].map(r)
edit: the answer using factorize is probably better even though I think this is a bit easier to read
Create a mapping from u_values to r and map it to car_id column. Also simplify the definition of u_values and r by using tolist() method and f-strings, respectively.
u_values = df['car_id'].unique().tolist()
r = [f'car_{i}' for i in range(len(u_values))]
mapping = pd.Series(r, index=u_values)
df['car_id'] = df['car_id'].map(mapping)
That said, it seems vectorized string concatenation is enough for this task. factorize() method encodes the strings.
df['car_id'] = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string')
When I timed some these methods (I omitted Juan Manuel Rivera's solution because replace is very slow and the code takes forever on larger data), the map() implementation that built on OP's code turned out to be the fastest.
The factorize() implementation, while concise, is not fast after all. Also I agree with pasnik that their solution is the easiest to read.
# a dataframe with 500k rows and 100k unique car_ids
df = pd.DataFrame({'car_id': np.random.default_rng().choice(100000, size=500000)})
%timeit u_values = df['car_id'].unique().tolist(); r = [f'car_{i}' for i in range(len(u_values))]; mapping = pd.Series(r, index=u_values); df.assign(car_id=df['car_id'].map(mapping))
# 136 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(car_id = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string'))
# 602 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}; df.assign(car_id=df['car_id'].map(r))
# 196 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It may be easier if you use a dictionary to maintain the relation between each unique value (xxxx,yyyy...) and the new id you want (1, 2, 3...)
newIdDict={}
idCounter=1
for i in df['Car id'].unique():
if i not in newIdDict:
newIdDict[i] = 'car_'+str(idCounter)
idCounter += 1
Then, you can use Pandas replace function to change the values in car_id column:
df['Car id'].replace(newIdDict, inplace=True)
Take into account that this will change ALL the xxxx, yyyy in your dataframe, so if you have any xxxx in lat or long columns it will also be modified
I need to calculate distances between two data points ((lat1,lon1) and (lat2,lon2)).
I found a way how to do it here:
import geopy.distance
coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)
print geopy.distance.vincenty(coords_1, coords_2).km
As a result I need to convert latitude and longitude to one column
I found a way here, however, it takes to much time.
df["point1"] = df[["lon1", "lat1"]].apply(Point, axis=1)
df["point2"] = df[["lon2", "lat2"]].apply(Point, axis=1)
Is there a faster solution?
Try using geopandas.points_from_xy():
import geopandas
df['points1'] = geopandas.points_from_xy(df.lon1, df.lat1)
df['points2'] = geopandas.points_from_xy(df.lon2, df.lat2)
If it is still too slow, install pygeos which will vectorize points_from_xy() and speed it up more.
If you want tuples of the form (x,y) you can do this:
Imagine your dataframe looks like this:
df = pd.read_csv(r"C:\users\k_sego\LatLong.csv", sep=";")
print(df)
Lat Lon
0 59.214735 18.062262
1 59.214735 18.062262
2 59.214735 18.062262
3 59.213542 18.063627
4 59.212553 18.064678
.. ... ...
70 59.199559 18.046147
71 59.199559 18.046147
72 59.199559 18.046147
73 59.198898 18.051291
74 59.199044 18.055571
Then
df['new_col'] = list(zip(df.Lat, df.Lon))
produces this:
Lat Lon new_col
0 59.214735 18.062262 (59.214735, 18.062262)
1 59.214735 18.062262 (59.214735, 18.062262)
2 59.214735 18.062262 (59.214735, 18.062262)
3 59.213542 18.063627 (59.213542, 18.063627)
4 59.212553 18.064678 (59.212553, 18.064678)
.. ... ... ...
70 59.199559 18.046147 (59.199559, 18.046147)
71 59.199559 18.046147 (59.199559, 18.046147)
72 59.199559 18.046147 (59.199559, 18.046147)
73 59.198898 18.051291 (59.198898, 18.051291)
74 59.199044 18.055571 (59.199044, 18.055571)
If you want 'point' as a tuple -
df['point1'] = list(zip(df['lat1'].values, df['lon1'].values))
If you want 'point' as a list -
df['point1'] = list(map(list,zip(df['lat1'].values, df['lon1'].values)))
Performance Comparison ->
%timeit geopandas.points_from_xy(df.D, df.B)
108 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(map(list,zip(df['D'].values, df['B'].values)))
4.82 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As you can see if you use zip/list/map it'll be a lot faster.
I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)
What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.
You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01
Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.
try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.
I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Im using 12 hours sensor data at 25Hrz that I query from mongo db into a dataframe
I'm trying to extract a list or a dict of 1 minute dataframes from the 12 hours.
I use a window of 1 minute and a stride/ step of 10 seconds.
The goal is to build a dataset by creating al list or dict of 1 minute dataframes/samples from 12 hours of data, that will be converted to tensor and fed to a deep learning model.
The index of the dataframe is datetime and 4 columns of sensor values.
here is how part of the data looks like:
A B C D
2020-06-17 22:00:00.000 1.052 -0.147 0.836 0.623
2020-06-17 22:00:00.040 1.011 -0.147 0.820 0.574
2020-06-17 22:00:00.080 1.067 -0.131 0.868 0.607
2020-06-17 22:00:00.120 1.033 -0.163 0.820 0.607
2020-06-17 22:00:00.160 1.030 -0.147 0.820 0.607
below is a sample code that is similar to how I extract windows of 1 minutes data. For 12 hours it takes 5 minutes-which is a long time..
Any ideas on how to reduce the running time in this case?
step= 10*25
w=60*25
df # 12 hours df data
sensor_dfs=[]
df_range = range(0, df.shape[0]-step, step)
for a in df_range:
sample = df.iloc[a:a+w]
sensor_dfs.append(sample)
I created random data and made the following experiments looking at runtime:
# create random normal samples
w= 60*25 # 1 minute window
step=w # no overlap
num_samples=50000
data= np.random.normal(size=(num_samples,3))
date_rng=pd.date_range(start="2020-07-09 00:00:00.000",
freq="40ms",periods=num_samples)
data=pd.DataFrame(data, columns=["x","y","z"], index=date_rng)
data.head()
x y z
2020-07-09 00:00:00.000 -1.062264 -0.008656 0.399642
2020-07-09 00:00:00.040 0.182398 -1.014290 -1.108719
2020-07-09 00:00:00.080 -0.489814 -0.020697 0.651120
2020-07-09 00:00:00.120 -0.776405 -0.596601 0.611516
2020-07-09 00:00:00.160 0.663900 0.149909 -0.552779
numbers are of type float64
data.dtypes
x float64
y float64
z float64
dtype: object
using for loops
minute_samples=[]
for i in range(0,len(data)-w,step):
minute_samples.append(data.iloc[i:i+w])
result:6.45 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using list comprehension
minute_samples=[data.iloc[i:i+w] for i in range(0,len(data)-w,step)]
result: 6.13 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using Grouper with list comprehension
minute_samples=[df for i, df in data.groupby(pd.Grouper(freq="1T"))]
result:7.89 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using grouper with dict
minute_samples=dict(tuple(data.groupby(pd.Grouper(freq="1T"))))
result: 7.41 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
resample is also an option here but since behind the scenes it uses grouper then I don't think it will be different in terms of runtime
It seems like list comprehension is slightly better than the rest