I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)
180762508,1268510763,374723980,293,20180402035748,198,25,1,1
180762508,1268503685,374717256,307,20180402035758,225,38,1,1
180762508,1268492506,374708540,236,20180402035808,222,52,1,1
180762508,1268485868,374697563,248,20180402035818,197,47,1,1
180762508,1268482430,374688520,272,20180402035828,196,31,1,1
180707764,1270608366,374988433,246,20180402035925,66,37,1,0
180707764,1270620899,374992366,222,20180402035935,68,49,1,0
first column is unique id and the last column is my interest
I wanna know how can I find last column is changed from 0 to 1
I made a really big data frame with this dataset in pandas
import glob
import pandas as pd
path = r"1\1"
allFiles = glob.glob(path+"\*.DAT")
list=[]
for filename in allFiles:
df = pd.read_csv(filename, header = None)
list.append(df)
a = pd.concat(list)
a.head()
this is all I did
I don't have error but I wanna know the algorithm that I can find the last columns' value changed in each unique id
my goal is made a data frame that
first column is unique id and second, third column is latitude, longitude which is in third, second columns in my dataset and the time stamp which is in 5th columns that last column's value is changed from 0 to 1
If I understood you, you need to get the 5th row, where the change from 0 to 1, in the last column, takes place.
I made a dataframe with your first and last column (by the way, you said the 1st column is some kind of unique id, but I see repeated numbers), anyway based on your sample data, one possible solution is:
import pandas as pd
data = [[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180707764,0],[180707764,0]]
df = pd.DataFrame(data,columns=['my_id','interest'])
#new dataframe to compare the column interest
df2 = df.loc[df['interest'] != df['interest'].shift(-1)]
#output:
# my_id interest
# 4 180762508 1
# 6 180707764 0
imax = df2.index.max() #index after the change
imin = df2.index.min() #index before the change
for i in range(imin,imax,1):
i
#the row with the change in the original dataframe
print(df.loc[i])
Hi and thanks for posting. It looks like the first column doesn't have unique values, so I'm guessing you want to index returned or timestamp returned?
In any case, here's a sample of what might work for you if you want to find when the interest column for an ID changes from 0 to 1:
import pandas as pd
# Provided data
raw_str = """
180762508,1268510763,374723980,293,20180402035748,198,25,1,1 180762508,1268503685,374717256,307,20180402035758,225,38,1,1 180762508,1268492506,374708540,236,20180402035808,222,52,1,1 180762508,1268485868,374697563,248,20180402035818,197,47,1,1 180762508,1268482430,374688520,272,20180402035828,196,31,1,1 180707764,1270608366,374988433,246,20180402035925,66,37,1,0 180707764,1270620899,374992366,222,20180402035935,68,49,1,0
"""
# Replace newline and split on single whitespace
chunks = raw_str.replace('\n', '').split(' ')
# Create simple dictionary for ID, timestamp, and interest columns
ddict = {}
ddict['id'] = [i.split(',')[0] for i in chunks]
ddict['timestamp'] = [i.split(',')[4] for i in chunks]
ddict['interest'] = [i.split(',')[-1] for i in chunks]
# Convert dictionary to pandas DataFrame
df = pd.DataFrame(ddict)
# Create dictionary for sample data
# This is an existing ID with timestamp in the future and 1 as interest
tdict = {
'id': '180707764',
'timestamp': '20180402035945',
'interest': '1',
}
What df looks like:
id timestamp interest
0 180707764 20180402035925 0
1 180707764 20180402035935 0
2 180707764 20180402035945 1
3 180762508 20180402035748 1
4 180762508 20180402035758 1
5 180762508 20180402035808 1
6 180762508 20180402035818 1
7 180762508 20180402035828 1
Continuing on:
# Append that dictionary to your dataframe and sort by id, timestamp
df = df.append(pd.Series(tdict), ignore_index=True).copy(deep=True)
df = df.sort_values(['id', 'timestamp']).reset_index(drop=True)
# Shift dataframe back 1 period by rows
df2 = pd.DataFrame(df.shift(periods=-1, axis=0)
# Merge that dataframe with our original dataframe by index values
# We're dropping an extra id column and renaming our primary id column for aesthetics
df3 = df.merge(df2, left_index=True, right_index=True, suffixes=('_prev', '_curr'))
df3 = df3.drop('id_curr', axis=1).rename(columns={'id_prev': 'id'})
What df3 looks like:
id timestamp_prev interest_prev timestamp_curr interest_curr
0 180707764 20180402035925 0 20180402035935 0
1 180707764 20180402035935 0 20180402035945 1
2 180707764 20180402035945 1 20180402035748 1
3 180762508 20180402035748 1 20180402035758 1
4 180762508 20180402035758 1 20180402035808 1
5 180762508 20180402035808 1 20180402035818 1
6 180762508 20180402035818 1 20180402035828 1
7 180762508 20180402035828 1 NaN NaN
Now we can just create a conditional to return the row where interest changed from 0 to 1:
In[0]: df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]
Which returns:
timestamp_prev interest_prev id_curr timestamp_curr interest_curr
1 20180402035935 0 180707764 20180402035945 1
You can also return specific columns by adding those onto the end of the result set:
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]['timestamp_y']
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')][['id', 'timestamp_y']]
Or use the original dataframe (df) and .iloc to get specified data:
df.iloc[df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')].index, :]
Out:
id timestamp interest
1 180707764 20180402035935 0
I have a pandas dataframe which looks like below:
print (df)
customerid acc_type amount premium_member
0 1 Savings 200 N
1 1 Current 300 Y
2 2 Savings 250 N
I want it to transform to below data frame which converts acc_type and amount into 2 and 2 columns. (Dropping original ones).
Also at max it is sure that any customer cannot have more than two rows in original dataframe where account type is savings/current(not any other value).
Premium_member attribute is computed by taking Logical OR of boolean (Y and N) values.
Use:
#filter only 2 rows per customerid
df = df[df.groupby('customerid')['acc_type'].transform('size') < 3]
#new column
df['is'] = 1
#reshape and replace missing values to 0
df1 = df.set_index(['customerid','acc_type']).unstack(fill_value=0)
#check if Y in premium_member
s = df1.pop('premium_member').eq('Y').any(axis=1)
#change order of columns
df1 = df1.sort_index(axis=1, ascending=False)
#flatten MultiIndex
df1.columns = df1.columns.map(''.join)
#new column
df1['premium_member'] = np.where(s, 'Y','N')
#convert index to column
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
customerid isSavings isCurrent amountSavings amountCurrent \
0 1 1 1 200 300
1 2 1 0 250 0
premium_member
0 Y
1 N
I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:
Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7