Pandas DataFrame column concatenation - python

I have a pandas Dataframe y with 1 million rows and 5 columns.
np.shape(y)
(1037889, 5)
The column values are all 0 or 1. Looks something like this:
y.head()
a, b, c, d, e
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I want a Dataframe with 1 million rows and 1 column.
np.shape(y)
(1037889, )
where the column is just the 5 columns concatenated together.
New column
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I keep trying different things like merge, concat, dstack, etc...
but can't seem to figure this out.

If you want new column to have all data concatenated to string, it's good case for apply() function:
>>> df = pd.DataFrame({'a':[0,1,0,0], 'b':[0,0,1,0], 'c':[1,0,1,0], 'd':[0,1,1,0], 'c':[0,1,1,0]})
>>> df
a b c d
0 0 0 0 0
1 1 0 1 1
2 0 1 1 1
3 0 0 0 0
>>> df2 = df.apply(lambda row: ','.join(map(str, row)), axis=1)
>>> df2
0 0,0,0,0
1 1,0,1,1
2 0,1,1,1
3 0,0,0,0

Related

Insert value in numpy array with conditions

I want to insert the value in the NumPy array as follows,
If Nth row is the same as (N-1)th row insert 1 for Nth row and (N-1)th row and rest 0
If Nth row is different from (N_1)th row then change column and repeat condition
Here is the example
d = {'col1': [2,2, 3,3,3, 4,4, 5,5,5,],
'col2': [3,3, 4,4,4, 1,1, 0,0,0]}
df = pd.DataFrame(data=d)
np.zeros((10,4))
###########################################################
OUTPUT MATRIX
1 0 0 0 First two rows are the same so 1,1 in a first column
1 0 0 0
0 1 0 0 Three-rows are same 1,1,1
0 1 0 0
0 1 0 0
0 0 1 0 Again two rows are the same 1,1
0 0 1 0
0 0 0 1 Again three rows are same 1,1,1
0 0 0 1
0 0 0 1
IIUC, you can achieve this simply with numpy indexing:
# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
# craft the numpy array
a = np.zeros((len(group), group.max()+1), dtype=int)
a[np.arange(len(df)), group] = 1
print(a)
Alternative with numpy.identity:
# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
shape = df.groupby(group).size()
# craft the numpy array
a = np.repeat(np.identity(len(shape), dtype=int), shape, axis=0)
print(a)
output:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
intermediates:
group
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
dtype: int64
shape
0 2
1 3
2 2
3 3
dtype: int64
other option
for fun, likely no so efficient on large inputs:
a = pd.get_dummies(df.agg(tuple, axis=1)).to_numpy()
Note that this second option uses groups of identical values, not successive identical values. For identical values with the first (numpy) approach, you would need to use group = df.groupby(list(df)).ngroup() and the numpy indexing option (this wouldn't work with repeating the identity).

Python: How to compare values of a row with a threshold to determine cycles

I have the following code I made that gets data from a machine in CSV format:
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
#calculates rolling and rolling center of pressure values
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
#sets threshold for machine being on or off, if rolling center average is greater than 115 psi, machine is considered on
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
df
The following DF is created:
Throughout the rows in column "Machine On/Off" there will be values of 1 or 0 based on the threshold i set. I need to write a code that will go through these rows and indicate if a cycle has started. The problem is due to the data being slightly off, during a "on" cycle, there will be around 20 rows saying (1) with a couple of rows saying 0 due to poor data recieved.
I need to have a code that compares the values through the data in order to determine the amount of cycles the machine is on or off for. I was thinking that setting a threshold of around would work, so that if the value is (1) for more than 6 rows then it will indicate a cycle and ignore the incorrect 0's that are scattered throughout the column.
What would be the best way program this so I can get a total count of cycles the machine is on or off for throughout the 20,000 rows of data I have.
Edit: Here is a example Df that is similar, in this example we can see there are 3 cycles of the machine (1 values) and mixed into the on cycles is 0 values (bad data). I need a code that will count the total number of cycles and ignore the bad data that may be in the middle of a 'on cycle'.
import pandas as pd
Machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df2 = pd.DataFrame(Machine)
You can create groups of consecutive rows of on/off using cumsum:
machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df = pd.DataFrame(machine, columns=['Machine On/Off'])
df['group'] = df['Machine On/Off'].ne(df['Machine On/Off'].shift()).cumsum()
df['group_size'] = df.groupby('group')['group'].transform('size')
# Output
Machine On/Off group group_size
0 0 1 6
1 0 1 6
2 0 1 6
3 0 1 6
4 0 1 6
5 0 1 6
6 1 2 5
7 1 2 5
8 1 2 5
9 1 2 5
10 1 2 5
I'm not sure I got your intention on how you would like to filter/alter the values, but probably this can serve as a guide:
threshold = 6
# Replace 0 for 1 if group_size < threshold. This will make the groupings invalid.
df.loc[(df['Machine On/Off'].eq(0)) & (df.group_size.lt(threshold)), 'Machine On/Off'] = 1
# Output df['Machine On/Off'].values
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

How to create new DataFrame based on conditions from another DataFrame

Suppose I have the following DataFrame:
A B C D E F Cost
0 1 1 0 0 0 10
0 0 1 0 1 0 3
1 0 0 0 0 1 5
0 1 0 1 0 0 7
I want to construct a new DataFrame based on the values above.
Specifically, if value==1 then combine their columns into one and assign value for the new column from Cost column above.
So the expected output would be something like:
BC CE AF BD
10 3 5 7
How can I achieve such thing?
We can try the dot of the the binary columns with the column names to get the key string based on 1s and 0s, then add the Cost Column back:
cols = df.columns.difference(['Cost'])
new_df = df[cols].dot(cols).to_frame(name='key')
new_df['Cost'] = df['Cost']
new_df:
key Cost
0 BC 10
1 CE 3
2 AF 5
3 BD 7
The DataFrame can be transposed if needed:
cols = df.columns.difference(['Cost'])
new_df = df[cols].dot(cols).to_frame(name='key')
new_df['Cost'] = df['Cost']
new_df = new_df.set_index('key').T.rename_axis(columns=None)
new_df:
BC CE AF BD
Cost 10 3 5 7
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
"A": [0, 0, 1, 0],
"B": [1, 0, 0, 1],
"C": [1, 1, 0, 0],
"D": [0, 0, 0, 1],
"E": [0, 1, 0, 0],
"F": [0, 0, 1, 0],
"Cost": [10, 3, 5, 7],
})
You don't need a loop to do it. With datar, you can achieve it with dplyr-like syntax:
>>> from datar.all import *
>>>
>>> # Create the df
>>> df = tribble(
... f.A, f.B, f.C, f.D, f.E, f.F, f.Cost,
... 0, 1, 1, 0, 0, 0, 10,
... 0, 0, 1, 0, 1, 0, 3,
... 1, 0, 0, 0, 0, 1, 5,
... 0, 1, 0, 1, 0, 0, 7,
... )
>>> df
A B C D E F Cost
<int64> <int64> <int64> <int64> <int64> <int64> <int64>
0 0 1 1 0 0 0 10
1 0 0 1 0 1 0 3
2 1 0 0 0 0 1 5
3 0 1 0 1 0 0 7
>>> # replace value with column names
>>> df = df >> mutate(across(f[1:6], lambda x: if_else(x, x.name, "")))
>>> df
A B C D E F Cost
<object> <object> <object> <object> <object> <object> <int64>
0 B C 10
1 C E 3
2 A F 5
3 B D 7
>>> # unite the columns
>>> df = df >> unite('col', f[1:6], sep="")
>>> df
col Cost
<object> <int64>
0 BC 10
1 CE 3
2 AF 5
3 BD 7
>>> # reshape the result
>>> df >> column_to_rownames(f.col) >> t()
BC CE AF BD
<int64> <int64> <int64> <int64>
Cost 10 3 5 7
Disclaimer: I am the author of the datar package.
Here is how I will proceed:
Create the df
data = {
"A": [0, 0, 1, 0],
"B": [1, 0, 0, 1],
"C": [1, 1, 0, 0],
"D": [0, 0, 0, 1],
"E": [0, 1, 0, 0],
"F": [0, 0, 1, 0],
"Cost": [10, 3, 5, 7],
}
df = pd.DataFrame(data)
Get the columns names
def make_df(row):
row = row.to_dict()
return "".join([k for k, v in row.items() if v if k!="Cost"])
df_ind = df.apply(make_df, axis=1)
Create the desired data frame
pd.DataFrame(df.Cost.values, index=df_ind.values).T
This will give you:
BC CE AF BD
10 3 5 7
Not as nice as previous answers, but straightforward & step-by-step:
outp_dict = {}
for index, row in df.iterrows():
new_col = ""
col_nr = 0
for value in row:
if value and row.index[col_nr] is not "Cost":
new_col += str(row.index[col_nr])
col_nr += 1
outp_dict[new_col] = row[-1]
outp_df = pd.DataFrame(outp_dict, index = [0])

Calculate the average of sections of a column with condition met to create new dataframe

I have the below data table
A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})
I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:
Thanks for your help!
Keywords: groupby, shift, mean
Code:
df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]
df_result
A B
1 2.0 1.0
3 3.0 1.0
As you might noticed, you need first to determine the consecutive rows blocks having the same values.
One way to do so is by shifting B one row and then comparing it with itself.
df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's
df['A'] =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B'] =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted'] =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
[↑ ][↑ ][↑ ][↑ ]
Now we can use the groupby pandas method as follows:
df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())
Now if we looped in the DtaFrameGroupBy object df_grouped
we'll see the following tuples:
(0, A B B_shifted
0 2 0 0)
(1, A B B_shifted
1 3 1 0
2 1 1 1
3 2 1 1)
(2, A B B_shifted
4 4 0 1
5 1 0 0)
(3, A B B_shifted
6 5 1 0
7 3 1 1
8 1 1 1)
(4, A B B_shifted
9 7 0 1
10 5 0 0)
We can simply calculate the mean and filter the zero values now as follow
df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]
References:(link, link).
Try:
m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)
Prints:
A B
0 2 1
1 3 1
df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1
Output
A B
0 2 1
1 3 1
Explanation
We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.
Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.

numpy/pandas: How to convert a series of strings of zeros and ones into a matrix

I have a data that arrives in this format:
[
(1, "000010101001010101011101010101110101", "aaa", ... ),
(0, "111101010100101010101110101010111010", "bb", ... ),
(0, "100010110100010101001010101011101010", "ccc", ... ),
(1, "000010101001010101011101010101110101", "ddd", ... ),
(1, "110100010101001010101011101010111101", "eeee", ... ),
...
]
In tuple format, it looks like this:
(Y, X, other_info, ... )
At the end of the day, I need to train a classifier (e.g. sklearn.linear_model.logistic.LogisticRegression) using Y and X.
What's the most straightforward way to turn the string of ones and zeros into something like a np.array, so that I can run it through the classifier? Seems like there should be an easy answer here, but I haven't been able to think of/google one.
A few notes:
I'm already using numpy/pandas/sklearn, so anything in those libraries is fair game.
For a lot of what I'm doing, it's convenient to have the other_info columns together in a DataFrame
The strings are is pretty long (~20,000 columns), but the total data frame is not very tall (~500 rows).
Since you asked primarily for a way to convert a string of ones and zeros into a numpy array, I'll offer my solution as follows:
d = '0101010000' * 2000 # create a 20,000 long string of 1s and 0s
d_array = np.fromstring(d, 'int8') - 48 # 48 is ascii 0. ascii 1 is 49
This compares favourable to #DSM's solution in terms of speed:
In [21]: timeit numpy.fromstring(d, dtype='int8') - 48
10000 loops, best of 3: 35.8 us per loop
In [22]: timeit numpy.fromiter(d, dtype='int', count=20000)
100 loops, best of 3: 8.57 ms per loop
How about something like this:
Make the dataframe:
In [82]: v = [
....: (1, "000010101001010101011101010101110101", "aaa"),
....: (0, "111101010100101010101110101010111010", "bb"),
....: (0, "100010110100010101001010101011101010", "ccc"),
....: (1, "000010101001010101011101010101110101", "ddd"),
....: (1, "110100010101001010101011101010111101", "eeee"),
....: ]
In [83]:
In [83]: df = pandas.DataFrame(v)
We can use fromiter or array to get an ndarray:
In [84]: d ="000010101001010101011101010101110101"
In [85]: np.fromiter(d, int) # better: np.fromiter(d, int, count=len(d))
Out[85]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
In [86]: np.array(list(d), int)
Out[86]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
There might be a slick vectorized way to do this, but I'd just apply the obvious per-entry function to the values and get on with my day:
In [87]: df[1]
Out[87]:
0 000010101001010101011101010101110101
1 111101010100101010101110101010111010
2 100010110100010101001010101011101010
3 000010101001010101011101010101110101
4 110100010101001010101011101010111101
Name: 1
In [88]: df[1] = df[1].apply(lambda x: np.fromiter(x, int)) # better with count=len(x)
In [89]: df
Out[89]:
0 1 2
0 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 aaa
1 0 [1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 bb
2 0 [1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 ccc
3 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 ddd
4 1 [1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 eeee
In [90]: df[1][0]
Out[90]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])

Categories

Resources