how to detect value changed in python, pandas in each object - python

180762508,1268510763,374723980,293,20180402035748,198,25,1,1
180762508,1268503685,374717256,307,20180402035758,225,38,1,1
180762508,1268492506,374708540,236,20180402035808,222,52,1,1
180762508,1268485868,374697563,248,20180402035818,197,47,1,1
180762508,1268482430,374688520,272,20180402035828,196,31,1,1
180707764,1270608366,374988433,246,20180402035925,66,37,1,0
180707764,1270620899,374992366,222,20180402035935,68,49,1,0
first column is unique id and the last column is my interest
I wanna know how can I find last column is changed from 0 to 1
I made a really big data frame with this dataset in pandas
import glob
import pandas as pd
path = r"1\1"
allFiles = glob.glob(path+"\*.DAT")
list=[]
for filename in allFiles:
df = pd.read_csv(filename, header = None)
list.append(df)
a = pd.concat(list)
a.head()
this is all I did
I don't have error but I wanna know the algorithm that I can find the last columns' value changed in each unique id
my goal is made a data frame that
first column is unique id and second, third column is latitude, longitude which is in third, second columns in my dataset and the time stamp which is in 5th columns that last column's value is changed from 0 to 1

If I understood you, you need to get the 5th row, where the change from 0 to 1, in the last column, takes place.
I made a dataframe with your first and last column (by the way, you said the 1st column is some kind of unique id, but I see repeated numbers), anyway based on your sample data, one possible solution is:
import pandas as pd
data = [[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180707764,0],[180707764,0]]
df = pd.DataFrame(data,columns=['my_id','interest'])
#new dataframe to compare the column interest
df2 = df.loc[df['interest'] != df['interest'].shift(-1)]
#output:
# my_id interest
# 4 180762508 1
# 6 180707764 0
imax = df2.index.max() #index after the change
imin = df2.index.min() #index before the change
for i in range(imin,imax,1):
i
#the row with the change in the original dataframe
print(df.loc[i])

Hi and thanks for posting. It looks like the first column doesn't have unique values, so I'm guessing you want to index returned or timestamp returned?
In any case, here's a sample of what might work for you if you want to find when the interest column for an ID changes from 0 to 1:
import pandas as pd
# Provided data
raw_str = """
180762508,1268510763,374723980,293,20180402035748,198,25,1,1 180762508,1268503685,374717256,307,20180402035758,225,38,1,1 180762508,1268492506,374708540,236,20180402035808,222,52,1,1 180762508,1268485868,374697563,248,20180402035818,197,47,1,1 180762508,1268482430,374688520,272,20180402035828,196,31,1,1 180707764,1270608366,374988433,246,20180402035925,66,37,1,0 180707764,1270620899,374992366,222,20180402035935,68,49,1,0
"""
# Replace newline and split on single whitespace
chunks = raw_str.replace('\n', '').split(' ')
# Create simple dictionary for ID, timestamp, and interest columns
ddict = {}
ddict['id'] = [i.split(',')[0] for i in chunks]
ddict['timestamp'] = [i.split(',')[4] for i in chunks]
ddict['interest'] = [i.split(',')[-1] for i in chunks]
# Convert dictionary to pandas DataFrame
df = pd.DataFrame(ddict)
# Create dictionary for sample data
# This is an existing ID with timestamp in the future and 1 as interest
tdict = {
'id': '180707764',
'timestamp': '20180402035945',
'interest': '1',
}
What df looks like:
id timestamp interest
0 180707764 20180402035925 0
1 180707764 20180402035935 0
2 180707764 20180402035945 1
3 180762508 20180402035748 1
4 180762508 20180402035758 1
5 180762508 20180402035808 1
6 180762508 20180402035818 1
7 180762508 20180402035828 1
Continuing on:
# Append that dictionary to your dataframe and sort by id, timestamp
df = df.append(pd.Series(tdict), ignore_index=True).copy(deep=True)
df = df.sort_values(['id', 'timestamp']).reset_index(drop=True)
# Shift dataframe back 1 period by rows
df2 = pd.DataFrame(df.shift(periods=-1, axis=0)
# Merge that dataframe with our original dataframe by index values
# We're dropping an extra id column and renaming our primary id column for aesthetics
df3 = df.merge(df2, left_index=True, right_index=True, suffixes=('_prev', '_curr'))
df3 = df3.drop('id_curr', axis=1).rename(columns={'id_prev': 'id'})
What df3 looks like:
id timestamp_prev interest_prev timestamp_curr interest_curr
0 180707764 20180402035925 0 20180402035935 0
1 180707764 20180402035935 0 20180402035945 1
2 180707764 20180402035945 1 20180402035748 1
3 180762508 20180402035748 1 20180402035758 1
4 180762508 20180402035758 1 20180402035808 1
5 180762508 20180402035808 1 20180402035818 1
6 180762508 20180402035818 1 20180402035828 1
7 180762508 20180402035828 1 NaN NaN
Now we can just create a conditional to return the row where interest changed from 0 to 1:
In[0]: df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]
Which returns:
timestamp_prev interest_prev id_curr timestamp_curr interest_curr
1 20180402035935 0 180707764 20180402035945 1
You can also return specific columns by adding those onto the end of the result set:
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]['timestamp_y']
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')][['id', 'timestamp_y']]
Or use the original dataframe (df) and .iloc to get specified data:
df.iloc[df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')].index, :]
Out:
id timestamp interest
1 180707764 20180402035935 0

Related

How to Pivot/Stack for multi header column dataframe

np.random.seed(2022) # added to make the data the same each time
cols = pd.MultiIndex.from_arrays([['A','A' ,'B','B'], ['min','max','min','max']])
df = pd.DataFrame(np.random.rand(3,4),columns=cols)
df.index.name = 'item'
A B
min max min max
item
0 0.009359 0.499058 0.113384 0.049974
1 0.685408 0.486988 0.897657 0.647452
2 0.896963 0.721135 0.831353 0.827568
There are two column headers and while working with csv, I get a blank column name for every other column on unmerging.
I want result that looks like this. How can I do it?
I tried to use pivot table but couldn't do it.
Try:
df = (
df.stack(level=0)
.reset_index()
.rename(columns={"level_1": "title"})
.sort_values(by=["title", "item"])
)
print(df)
Prints:
item title max min
0 0 A 0.762221 0.737758
2 1 A 0.930523 0.275314
4 2 A 0.746246 0.123621
1 0 B 0.044137 0.264969
3 1 B 0.577637 0.699877
5 2 B 0.601034 0.706978
Then to CSV:
df.to_csv('out.csv', index=False)

How to create a dataframe with a named-index and a unnamed-default-subindex

I want to create a dataframe with index of dates. But in one date there would be one record or more.
so I wanna create a dataframe like :
A B
2021-11-12 1 0 0
2 1 1
2021-11-13 1 0 0
2 1 0
3 0 1
so could I append any row with the same date into this dataframe, and the subindex would be auto-increased?
Or is there any other way to save records with the same date index in one dataframe?
Use:
#remove counter level
df = df.reset_index(level=1, drop=True)
#add new row
#your code
#correct add new row after last datetime
df = df.sort_index()
#add subindex
df = df.set_index(df.groupby(level=0).cumcount().add(1), append=True)

How to hide axis labels in python pandas dataframe?

I've used the following code to generate a dataframe which is supposed to be the input for a seaborn plot.
data_array = np.array([['index', 'value']])
for x in range(len(value_list)):
data_array = np.append(data_array, np.array([[int((x + 1)), int(value_list[x])]]), axis = 0)
data_frame = pd.DataFrame(data_array)
The output looks something like this:
0 1
0 index values
1 1 value_1
2 2 value_2
3 3 value_3
However, with this dataframe, seaborn returns an error. When comparing my data to the examples, I see that the first row is missing. The samples, being loaded in with load_dataset(), look something like this:
0 index values
1 1 value_1
2 2 value_2
3 3 value_3
How do I remove the first row of axis labels of my dataframe so that it looks like the samples provided? Removing the first row removes the strings "index" and "values", but not the axis label.
Numpy thinks that index and values row is also a row of the values of the dataframe and not the names of the columns.
I think this would be more pythonic way of doing this:
pd.DataFrame(list(enumerate(value_list, 1)), columns=['index', 'values'])
Don't know what value_list is. However I would recommend another way to create dataframe:
import pandas as pd
value_list = ['10', '20', '30']
data_frame = pd.DataFrame({
'index': range(len(value_list)),
'value': [int(x) for x in value_list]})
data_frame:
index value
0 0 10
1 1 20
2 2 30
Now you can easily change dataframe index and 'index' column:
data_frame.loc[:, 'index'] += 1
data_frame.index += 1
data_frame:
index value
1 1 10
2 2 20
3 3 30
Try:
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header
just slice your dataframe
df =data_frame[2:]
df.columns = data_frame.iloc[1] --it will set the column name

Grouping data from multiple columns in data frame into summary view

I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:
Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

Categories

Resources