Build a data frame that each rows has incremental numbers in python - python

I want to build a data frame with m column and n rows.
Each rows start with 1 and increment by 1 until m.
I've tried to find a solution, but I found only this solution for the columns.
I have also added a figure of a simple case.

Using assign to broadcast the rows in an empty DataFrame:
df = (
pd.DataFrame(index=range(3))
.assign(**{f'c{i}': i+1 for i in range(4)})
)
Output:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4

You can use np.tile:
import numpy as np
m = 4
n = 3
out = pd.DataFrame(np.tile(np.arange(1,m), (n,1)), columns=[f'c{num}' for num in range(m-1)])
Output:
c0 c1 c2
0 1 2 3
1 1 2 3
2 1 2 3

Try with this (no additional libraries needed):
df = pd.DataFrame({f'c{n}': [n + 1] * (m - 1) for n in range(m)})
Result with m = 4:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4

We just do np.one
m = 3
n = 4
out = pd.DataFrame(np.ones((m,n))*(np.arange(n)+1))
Out[139]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 2.0 3.0 4.0
2 1.0 2.0 3.0 4.0

Related

Number of character occurences in a list of strings

I have a dataframe with a list of strings and I would like to add columns with the number of occurrences of a character, sorted with the maximum to minimum occurrences
The datafrae is very big so I need an efficient way to calculate it
Originale df:
Item
0 ABABCBF
1 ABABCGH
2 ABABEFR
3 ABABFBF
4 ABACTC3
Wanted df:
Item o1 o2 o3 o4 o5
0 ABABCBF 3 2 1 1 null
1 ABABCGH 2 2 1 1 1
2 ABABEFR 2 2 1 1 1
3 ABABFBF 3 2 2 null null
4 ABACTC3 2 2 1 1 1
I have tried using collection counter but I am not able to convert the result in the column of the dataframe
collections.Counter(df['item'])
Thanks
You can use collections.Counter and the DataFrame constructor:
from collections import Counter
out = df.join(pd.DataFrame(
sorted(Counter(x).values(), reverse=True)
for x in df['Item'])
.rename(columns=lambda x: f'o{x+1}')
)
print(out)
Output:
Item o1 o2 o3 o4 o5
0 ABABCBF 3 2 1 1.0 NaN
1 ABABCGH 2 2 1 1.0 1.0
2 ABABEFR 2 2 1 1.0 1.0
3 ABABFBF 3 2 2 NaN NaN
4 ABACTC3 2 2 1 1.0 1.0
Try:
import json
import pandas as pd
from collections import Counter
df = pd.DataFrame({'Item': ['ABACABDF', 'BACBDFHGAAAA']})
result = df.join(
pd.DataFrame(
json.loads(
df['Item']
.transform(lambda x: sorted(list(Counter(x).values()), reverse=True))
.to_json(orient='records')
)
)
.rename(columns=(lambda x: f'o{x+1}'))
)
result
Item o1 o2 o3 o4 o5 o6 o7
0 ABACABDF 3 2 1 1 1 NaN NaN
1 BACBDFHGAAAA 5 2 1 1 1 1.0 1.0
try this:
def count_chars(txt: str):
ser = pd.Series([*txt])
result = ser.value_counts().tolist()
return result
result = df.join(
pd.DataFrame([*df['Item'].apply(count_chars)]).rename(columns=lambda x: f'o{x+1}'))
print(result)
>>>
Item o1 o2 o3 o4 o5
0 ABABCBF 3 2 1 1.0 NaN
1 ABABCGH 2 2 1 1.0 1.0
2 ABABEFR 2 2 1 1.0 1.0
3 ABABFBF 3 2 2 NaN NaN
4 ABACTC3 2 2 1 1.0 1.0

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

Replace specific column values in pandas dataframe [duplicate]

This question already has answers here:
pandas replace multiple values one column
(7 answers)
Closed 3 years ago.
df = pd.DataFrame({'Tissues':['a1','x2','y3','b','c1','v2','w3'], 'M':[1,2,'a',4,'b','a',7]})
df.set_index('Tissues')
The dataframe looks like:
M
Tissues
a1 1
x2 2
y3 a
b 4
c1 b
v2 a
w3 7
How can I replace all as in column M with say a specific value, 2 and all bs to 3?
I tried:
replace_values = {'a':2, 'b':3}
df['M'] = df['M'].map(replace_values)
, but that changed other values not in the keys in replace_values to NaN:
Tissues M
0 a1 NaN
1 x2 NaN
2 y3 2.0
3 b NaN
4 c1 3.0
5 v2 2.0
6 w3 NaN
I see that I can do
df.loc[(df['M'] == 'a')] = 2
but can I do this efficiently for a, b and so on, instead of one by one?
Use df.replace:
df = pd.DataFrame({'Tissues':['a1','x2','y3','b','c1','v2','w3'], 'M':[1,2,'a',4,'b','a',7]})
df.set_index('Tissues')
replace_values = {'a':2, 'b':3}
df['M'] = df['M'].replace(replace_values)
Output:
>>> df
Tissues M
0 a1 1
1 x2 2
2 y3 2
3 b 4
4 c1 3
5 v2 2
6 w3 7
Fix your code by add fillna
df['M'] = df['M'].map(replace_values).fillna(df.M)
df
Tissues M
0 a1 1.0
1 x2 2.0
2 y3 2.0
3 b 4.0
4 c1 3.0
5 v2 2.0
6 w3 7.0
Use df.replace
replace_values = {'a':2, 'b':3}
df = df.replace({"M": replace_values})
Results:
Tissues M
0 a1 1
1 x2 2
2 y3 a
3 b 4
4 c1 b
5 v2 a
6 w3 7

(Speed Calculate) How Make Simple way?

I have huge dataframe,, hundred thousand row and column.
My data like this:
df
MAC T_1 X_1 Y_1 T_2 X_2 Y_2 T_3 X_3 Y_3 T_4 X_4 Y_4 T_5 X_5 Y_5 T_6 X_6 Y_6 T_7 X_7 Y_7
ID1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5
ID2 6 2 5 6 2 5 7 3 5 7 3 5 8 4 5 9 5 5 10 5 4
ID3 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5 6 2 5
I want to calculate the speed using this equation:
I used code:
df = pd.read_csv("data.csv")
def v_2(i):
return (df.ix[x,(5+3*(i-1))]-df.ix[x,(2+3*(i-1))])**2 + (df.ix[x,(6+3*(i-1))]-df.ix[x,(3+3*(i-1))])**2
def v(i):
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) ==0:
return 0
else:
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) <0:
return 0
else:
return math.sqrt(v_2(i)) / (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))])
for i in range(1,int((len(df.columns)-1)/3)):
v_result = list()
for x in range(len(df.index)):
v_2(i)
v(i)
v_result.append(v(i))
df_result[i]=v_result
my expected result:
MAC V1 V2 V3 V4 V5 V6
ID1 0 1 1 0 1 1
ID2 0 1 0 1 1 1
ID3 1 1 0 1 1 1
but this code takes huge time,
would you mind to give another idea more simple and fast process or using multiprocessing module.
thank you
The calculation can be sped up quite a bit through reshaping the data first, so that efficient pandas methods can be used. If that is not fast enough, you can then go down to the numpy array and apply the functions there.
first reshape the data from the wide format to a long format so that there are only 3 columns, T, X, Y. The column suffixes, i.e. _1, _2, etc are split out into a new index.
df = df.set_index('MAC')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack()
this produces the following data frame:
T X Y
MAC
ID1 1 1 1 1
2 1 1 1
3 2 1 2
4 3 1 3
5 3 1 3
6 4 1 4
7 5 1 5
ID2 1 6 2 5
2 6 2 5
3 7 3 5
4 7 3 5
5 8 4 5
6 9 5 5
7 10 5 4
ID3 1 1 1 1
2 2 1 2
3 3 1 3
4 3 1 3
5 4 1 4
6 5 1 5
7 6 2 5
Next calculate the del_X^2, del_Y^2 & del_t (I hope the usage of prefix del is unambiguous). This is easier done using these two utility functions to avoid repetition.
def f(x):
return x.shift(-1) - x
def f2(x):
return f(x)**2
update: description of functions
The first function calculates F(W,n) = W(n+1) - W(n), for all n, where n is the index of the array W. The second function squares its argument. These functions are composed to calculate the distance squared. See the documentation for pd.Series.shift for more information & examples.
using lower-case column names for the del prefix above and the suffix 2 to mean squared:
df['x2'] = df.groupby(level=0).X.transform(f2)
df['y2'] = df.groupby(level=0).Y.transform(f2)
df['t'] = df.groupby(level=0).Y.transform(f)
df['v'] = np.sqrt(df.x2 + df.y2) / df.t
df.v.unstack(0)
produces the following which is similar to your output, but transposed.
MAC ID1 ID2 ID3
1 NaN NaN 1.0
2 1.0 1.0 1.0
3 1.0 NaN NaN
4 NaN 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 NaN NaN NaN
you can filter out the last row (where the computed columns t, x2 & y2 are null), fill the np.nan in v with with 0, transpose, rename the columns & reset index to get at your desired result.
result = df[pd.notnull(df.t)].v.unstack(0).fillna(0).T
result.columns = ['V'+x for x in result.columns]
result.reset_index()
# outputs:
MAC V1 V2 V3 V4 V5 V6
0 ID1 0.0 1.0 1.0 0.0 1.0 1.0
1 ID2 0.0 1.0 0.0 1.0 1.0 1.0
2 ID3 1.0 1.0 0.0 1.0 1.0 1.0
I suggest you use Apache Spark if you want a real speed.
You can do that by passing your function to Spark as described here in this documentation:
Passing function to Spark

Pandas+Python - How to know when a value changes?

I've been working on a DataFrame, like the following extract and I want to know when the value changes:
A M C
0 2.0 1 C1
1 2.0 1 C1
2 2.0 2 C1
3 2.0 2 C1
4 2.0 3 C1
5 2.0 3 C1
6 2.0 1 C2
7 2.0 1 C2
8 2.0 2 C2
9 2.0 2 C3
10 2.0 3 C3
11 2.0 3 C3
13 2.1 1 C3
14 2.1 1 C3
15 2.1 2 C3
16 2.1 2 C3
17 2.1 3 C3
18 2.1 3 C3
I know that A or C, changes always when M starts in 1. The question is how can I get the position every time M value starts in 1?
I don't know if your entire data set is built the same way as the one you are showing us but from what I can see you are searching for occurrence of 3 to 1 in the m columns which would result in a difference of -2 :
df[df['M'].diff()==-2].index
Out[101]: Int64Index([6, 13], dtype='int64')
let's say your M column always increases but it can go higher than 3, you could just look for the first occurrence of a negative number such has:
df[df['M'].diff()<0].index
Out[103]: Int64Index([6, 13], dtype='int64')
let's say there is no pattern there you could simply do:
df[(df['M'].diff()!=0) & (df['M']==1)].index
Out[104]: Int64Index([0, 6, 13], dtype='int64')
this is adding index 0 because .diff() will return NaN for the first index of the dataframe which is !=0 and df['M'] ==0
Another way to determine when a new M set starts is to find where M is 1 and the previous M isn't:
In [18]: (df['M'] == 1) & (df["M"].shift() != 1)
Out[18]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
[.. and so on]
Name: M, dtype: bool
This includes the first element, but often makes sense. Once you have this, you can take its cumulative sum to get a group number associated with each group (because True == 1 and False == 0):
In [19]: df["group_index"] = ((df['M'] == 1) & (df["M"].shift() != 1)).cumsum()
In [20]: df
Out[20]:
A M C group_index
0 2.0 1 C1 1
1 2.0 1 C1 1
2 2.0 2 C1 1
3 2.0 2 C1 1
4 2.0 3 C1 1
5 2.0 3 C1 1
6 2.0 1 C2 2
7 2.0 1 C2 2
[.. and so on]
which is convenient because then you can use groupby to perform operations on the different clusters.

Categories

Resources