Say I have a dataframe containing strings, such as:
df = pd.DataFrame({'col1':list('some_string')})
col1
0 s
1 o
2 m
3 e
4 _
5 s
...
I'm looking for a way to apply a rolling window on col1 and join the strings in a certain window size. Say for instance window=3, I'd like to obtain (with no minimum number of observations):
col1
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
I've tried the obvious solutions with rolling which fail at handling object types:
df.col1.rolling(3, min_periods=0).sum()
df.col1.rolling(3, min_periods=0).apply(''.join)
Both raise:
cannot handle this type -> object
Is there a generalisable approach to do so (not using shift to match this specific case of w=3)?
How about shifting the series?
df.col1.shift(2).fillna('') + df.col1.shift().fillna('') + df.col1
Generalizing to any number:
pd.concat([df.col1.shift(i).fillna('') for i in range(3)], axis=1).sum(axis=1)
Rolling works only with numbers:
def _prep_values(self, values=None, kill_inf=True):
if values is None:
values = getattr(self._selected_obj, 'values', self._selected_obj)
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
if is_float_dtype(values.dtype):
values = ensure_float64(values)
elif is_integer_dtype(values.dtype):
values = ensure_float64(values)
elif needs_i8_conversion(values.dtype):
raise NotImplementedError...
...
...
So you should construct it manually. Here is one of the possible variants with simple list comprehensions (maybe there is a more Pandas-ish way exists):
df = pd.DataFrame({'col1':list('some_string')})
pd.Series([
''.join(df.col1.values[max(i-2, 0): i+1])
for i in range(len(df.col1.values))
])
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
dtype: object
Using pd.Series.cumsum seems like working (although bit of inefficient):
df['col1'].cumsum().str[-3:]
Output:
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
Name: col1, dtype: object
Related
Can you help on the following task? I have a dataframe column such as:
index df['Q0']
0 1
1 2
2 3
3 5
4 5
5 6
6 7
7 8
8 3
9 2
10 4
11 7
I want to substitute the values in df.loc[3:8,'Q0'] with the values in df.loc[0:2,'Q0'] if df.loc[0,'Q0']!=df.loc[3,'Q0']
The result should look like the one below:
index df['Q0']
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
9 2
10 4
11 7
I tried the following line:
df.loc[3:8,'Q0'].where(~df.loc[0,'Q0']!=df.loc[3,'Q0']),other=df.loc[0:2,'Q0'],inplace=True)
or
df['Q0'].replace(to_replace=df.loc[3:8,'Q0'], value=df.loc[0:2,'Q0'], inplace=True)
But it doesn't work. Most possible I am doing something wrong.
Any suggestions?
You can use the cycle function:
from itertools import cycle
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df["Q0"][3:8] = [next(c) for _ in range(5)]
Thanks for the replies. I tried the suggestions but I have some issues:
#adnanmuttaleb -
When I applied the function in a dataframe with more than 1 column (e.g. 12x2 or larger) I notice that the value in df.Q0[8] didn't change. Why?
#jezrael -
When I adjust to your suggestion I get the error:
ValueError: cannot copy sequence with size 5 to array axis with dimension 6
When I change the range to 6, I am getting wrong results
import pandas as pd
from itertools import cycle
data={'Q0':[1,2,3,5,5,6,7,8,3,2,4,7],
'Q0_New':[0,0,0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
##### version 1
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df['Q0_New'][3:8] = [next(c) for _ in range(5)]
##### version 2
d = cycle(df.loc[0:3,'Q0'])
if df.Q0[0] != df.Q0[3]:
df.loc[3:8,'Q0_New'] = [next(d) for _ in range(6)]
Why we have different behaviors and what corrections need to be made?
Thanks once more guys.
I need to round prices in a column to different number of decimals in Python. I am using this code to create the dataframe, df_prices:
df_prices = pd.DataFrame({'InstrumentID':['001','002','003','004','005','006'], 'Price':[12.44,6.5673,23.999,56.88,4333.22,27.8901],'RequiredDecimals':[2,0,1,2,0,3]})
The data looks like this:
InstrumentID Price RequiredDecimals
1 12.444 2
2 6.5673 0
3 23.999 1
4 56.88 2
5 4333.22 0
6 27.8901 3
I often get this issue returned:
TypeError: cannot convert the series to
Neither of these statements worked:
df_prices['PriceRnd'] = np.round(df_prices['Price'] , df_prices['RequiredDecimals'])
df_prices['PriceRnd'] = df_prices['Price'].round(decimals = df_prices['RequiredDecimals'] )
This is what the final output should look like:
Instrument# Price RequiredDecimals PriceRnd
1 12.444 2 12.44
2 6.5673 0 7
3 23.999 1 24.0
4 56.88 2 56.88
5 4333.22 0 4333
6 27.8901 3 27.890
Couldn't find a better solution, but this one seems to work
df['Rnd'] = [np.around(x,y) for x,y in zip(df['Price'],df['RequiredDecimals'])]
Although not elegant, you can try this.
import pandas as pd
df_prices = pd.DataFrame({'InstrumentID':['001','002','003','004','005','006'], 'Price':[12.44,6.5673,23.999,56.88,4333.22,27.8901],'RequiredDecimals':[2,0,1,2,0,3]})
print(df_prices)
list1 = []
for i in df_prices.values:
list1.append('{:.{}f}' .format(i[1], i[2]))
print(list1)
df_prices["Rounded Price"] =list1
print(df_prices)
InstrumentID Price RequiredDecimals Rounded Price
0 001 12.4400 2 12.44
1 002 6.5673 0 7
2 003 23.9990 1 24.0
3 004 56.8800 2 56.88
4 005 4333.2200 0 4333
5 006 27.8901 3 27.890
or a 1-liner code
df_prices['Rnd'] = ['{:.{}f}' .format(x, y) for x,y inzip(df_prices['Price'],df_prices['RequiredDecimals'])]
An alternative way would be to adjust the number that you are trying to round with an appropriate factor and then use the fact that the .round()-function always rounds to the nearest integer.
df_prices['factor'] = 10**df_prices['RequiredDecimals']
df_prices['rounded'] = (df_prices['Price'] * df_prices['factor']).round() / df_prices['factor']
After rounding, the number is divided again by the factor.
I want to create a panda series that contains the first ānā natural numbers and their respective squares. The first ānā numbers should appear in the index position by using manual indexing
Can someone please share a code with me
Use numpy.arange with ** for squares:
n = 5
s = pd.Series(np.arange(n) ** 2)
print (s)
0 0
1 1
2 4
3 9
4 16
dtype: int32
If want omit 0:
n = 5
arr = np.arange(1, n + 1)
s = pd.Series(arr ** 2, index=arr)
print (s)
1 1
2 4
3 9
4 16
5 25
dtype: int32
I have variable in pandas dataframe with values as below
print (df.xx)
1 5679558
2 (714) 254
3 0
4 00000000
5 000000000
6 00000000000
7 000000001
8 000000002
9 000000003
10 000000004
11 000000005
print (df.dtypes)
xx object
I am like below in order to convert this as num
try:
print df.xx.apply(str).astype(int)
except ValueError:
pass
I did try like this
tin.tin = tin.tin.to_string().astype(int)
But this giving me MemoryError, as I have 3M rows.
Can some body help me in stripping special chars and converting as int64?
You can test if the string isdigit and then use the boolean mask to convert those rows only in a vectorised manner and use to_numeric with param errors='coerce':
In [88]:
df.loc[df['xxx'].str.isdigit(), 'xxx'] = pd.to_numeric(df['xxx'], errors='coerce')
df
Out[88]:
xxx
0 5.67956e+06
1 (714) 254
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
10 5
You could split your huge dataframe into chunks, for example this method can do it where you can decide what is the chunk size:
def splitDataFrameIntoSmaller(df, chunkSize = 10000):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
After you have chunks, you can apply your function on each chunk separately.
I'm trying to parse a logfile of our manufacturing process. Most of the time the process is run automatically but occasionally, the engineer needs to switch into manual mode to make some changes and then switches back to automatic control by the reactor software. When set to manual mode the logfile records the step as being "MAN.OP." instead of a number. Below is a representative example.
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
ser_orig = pd.Series(steps)
which results in
0 1
1 2
2 2
3 MAN.OP.
4 MAN.OP.
5 2
6 2
7 3
8 3
9 MAN.OP.
10 MAN.OP.
11 4
12 4
dtype: object
I need to detect the 'MAN.OP.' and make them distinct from each other. In this example, the two regions with values == 2 should be one region after detecting the manual mode section like this:
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
I have code that iterates over this series and produces the correct result when the series is passed to my object. The setter is:
#step_series.setter
def step_series(self, ss):
"""
On assignment, give the manual mode steps a unique name. Leave
the steps done on recipe the same.
"""
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
counter = 0
continuous = False
for i in ss.index:
if continuous and ss.at[i] != manual_mode:
continuous = False
counter += 1
elif not continuous and ss.at[i] == manual_mode:
continuous = True
ss.at[i] = new_manual_mode_text.format(str(counter))
elif continuous and ss.at[i] == manual_mode:
ss.at[i] = new_manual_mode_text.format(str(counter))
self._step_series = ss
but this iterates over the entire dataframe and is the slowest part of my code other than reading the logfile over the network.
How can I detect these non-unique sections and rename them uniquely without iterating over the entire series? The series is a column selection from a larger dataframe so adding extra columns is fine if needed.
For the completed answer I ended up with:
#step_series.setter
def step_series(self, ss):
pd.options.mode.chained_assignment = None
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
newManOp = (ss=='MAN.OP.') & (ss != ss.shift())
ss[ss == 'MAN.OP.'] = 'Manual_Mode_' + (newManOp.cumsum()-1).astype(str)
self._step_series = ss
Here's one way:
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
steps = pd.Series(steps)
newManOp = (steps=='MAN.OP.') & (steps != steps.shift())
steps[steps=='MAN.OP.'] += seq.cumsum().astype(str)
>>> steps
0 1
1 2
2 2
3 MAN.OP.1
4 MAN.OP.1
5 2
6 2
7 3
8 3
9 MAN.OP.2
10 MAN.OP.2
11 4
12 4
dtype: object
To get the exact format you listed (starting from zero instead of one, and changing from "MAN.OP." to "Manual_mode_"), just tweak the last line:
steps[steps=='MAN.OP.'] = 'Manual_Mode_' + (seq.cumsum()-1).astype(str)
>>> steps
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
There a pandas enhancement request for contiguous groupby, which would make this type of task simpler.
There is s function in matplotlib that takes a boolean array and returns a list of (start, end) pairs. Each pair represents a contiguous region where the input is True.
import matplotlib.mlab as mlab
regions = mlab.contiguous_regions(ser_orig == manual_mode)
for i, (start, end) in enumerate(regions):
ser_orig[start:end] = new_manual_mode_text.format(i)
ser_orig
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object