Change the form of existing DataFrame - python

I want to change the form of existing dataframe to a new dataframe such that the value in the new dataframe matches the relationship of the existing two columns. Hence, in the new dataframe, "1" means there is a record in the existing dataframe and "0" means no record.
This is what I did so far. Basically through manual judging but this won't work when I have more than 1000 rows.
Existing dataframe:
series_1 = [[19,"a"],[20,"d"],[31,"d"],[31,"c"],[51,"d"]]
a_df = pd.DataFrame(series_1)
Desired dataframe:
cols = ["a","c","d"]
series_3 = [1,0,0,
0,0,1,
0,1,1,
0,0,1]
np_series = np.array(series_3).reshape(4,3)
c_df = pd.DataFrame(np_series,index = [19,20,31,51],columns=cols)
I'm wondering what are some good ways to transform the dataframe according to above request. Thank you!

try this:
pd.crosstab(a_df[0], a_df[1])
Result:
1 a c d
0
19 1 0 0
20 0 0 1
31 0 1 1
51 0 0 1

Quick Answer to your Question
import pandas as pd
dic = {
'0': [19,20,31,31,51],
'1': ['a','d','d','c','d']
}
df = pd.DataFrame(dic) #Creating a dataframe
unique_vals = df['1'].unique().tolist() # Finding unique values from desired column
for val in unique_vals:
df[val] = list(map(lambda item: 1 if item==val else 0,df['1'])) # Mapping to a new column
df.set_index('0', inplace = True) # Setting index
df.drop(['1'],axis = 1, inplace =True) #! Only use this line if you want to delete '1' column
print(df)
Output
0 a d c
19 1 0 0
20 0 1 0
31 0 1 0
31 0 0 1
51 0 1 0

Related

Python pandas create new column with string code based on boolean rows

I have a dataframe with multiple columns containing booleans/ints(1/0). I need a new result pandas column with strings that are built by following code: How many times the True's are consecutive, if the chain is interrupted or not, and from what column to what column the trues are.
For example this is the following dataframe:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8 column_9 column_10
0 0 1 0 1 1 1 1 0 0 1
1 0 1 1 0 1 1 1 0 0 1
2 1 1 0 0 0 1 1 0 0 1
3 1 1 1 0 0 0 0 1 1 1
4 1 1 1 0 0 1 0 0 1 1
5 1 1 1 0 0 0 1 1 0 1
6 0 1 1 1 1 1 1 0 1 0
Where the following row for example: 1: [0 1 1 0 1 1 1 0 0 1]
Would result in code string in the column_result: i2/2-3/c2-c3_c5-c7/6 which is build in four segments I can read somewhere in my code later.
Segment 1:
Where 'i' stands for interrupted, if not interrupted would be 'c' for consecutive
2 stands for how many times it found 2 or more consecutive True's,
Segment 2:
The consecutive count of the consecutive group, in this case the first consecutive count is 2, and the second count is 3..
Semgent 3:
The number/id of the column where the first True was found and the column number of where the last True was found of that consecutive True's.
Semgent 4:
Just the total count of Trues in the row.
Another example would be the following row: 6: [0 1 1 1 1 1 1 0 1 0]
Would result in code string in the column_result: c1/6/c2-c7/7
The below code is the startcode I used to create the above dataframe that has random int's for bools:
def create_custom_result(df: pd.DataFrame) -> pd.Series:
return df
def create_dataframe() -> pd.DataFrame:
df = pd.DataFrame() # empty df
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: # create random bool/int values
df[f'column_{i}'] = np.random.randint(2, size=50)
df["column_result"] = '' # add result column
return df
if __name__=="__main__":
df = create_dataframe()
custom_results = create_custom_result(df=df)
Would someone have any idea of how to tackle this? To be honest I have no idea where to start. I found the following that probably came closest: count sets of consecutive true values in a column, however, it uses the column and not the rows horizontal. Maybe someone can tell me if I should try np.array stuff, or maybe pandas has some function that can help me? I found some groupby functions that work horizontal, but I wouldnt know how to convert that to the string code to be used in the result column. Or should I loop through the Dataframe by rows and then build the column_result code in segments?
Thanks in advance!
I tried some things already, looping through the dataframe row by row, but had no idea how to build a new column with the code strings.
I also found this artikel: pandas groupby .. but wouldnt know how to create a new column str data by the group I found. Also, almost everything I find is group stuff by the single column and not through the rows of all columns.
these codes maybe works ?
df = pd.DataFrame(np.random.randint(0,2, size=(12,8)))
df.columns=["col1","col2","col3","col4","col5","col6","col7","col8"]
def func(df:pd.DataFrame) -> pd.DataFrame:
result_list = []
copy = df.copy()
cumsum = copy.cumsum(axis=1)
for r,s in cumsum.iterrows():
count = 0
last = -1
interrupted = 0
consecutive = 0
consecutives = []
ranges = []
for x in s.values:
count += 1
if x != 0:
if x!=last:
consecutive += 1
last = x
if consecutive == 2:
ranges.append(count-1)
elif x==last:
if consecutive > 1:
interrupted += 1
ranges.append(count-1)
consecutives.append(str(consecutive))
consecutive = 0
else:
if consecutive > 1:
consecutives.append(str(consecutive))
ranges.append(count)
result = f'{interrupted}i/{len(consecutives)}c/{"-".join(consecutives)}/{"_".join([ f"c{ranges[i]}-c{ranges[i+1]}" for i in range(0,len(ranges),2) ])}/{last}'
result_list.append(result.split("/"))
copy["results"] = pd.Series(["/".join(i) for i in result_list])
copy[["interrupts_count","consecutives_count","consecutives lengths","consecutives columns ranges","total"]] = pd.DataFrame(np.array(result_list))
return copy
result_df = func(df)
Maybe go with simple class for each column that will receive series from original DataFrame (i.e. sliced vertically) and new value. Using original DataFrame sliced vertical array calculate all starting values as fields (start of consecutive true values, length of consecutive true values, last value..). And finally using start values and new next value update fields and prepare string output.

How to create a new column in a Pandas DataFrame based on a column in another DataFrame?

*I am new to Python and Pandas
I need to do the following
I have 2 DataFrames, lets call them df1 and df2
df1
Index Req ID City_Atlanta City_Seattle City_Boston Result
0 X 1 0 0 0
1 Y 0 1 0 0
2 Z 0 0 1 1
df2
Index Req_ID City
0 X Atlanta
1 Y Seattle
2 Z Boston
I want to add a column in df2 called result such that df2.result = False if df1.result = 0 and df2.result = True if df1.result = 1
The final result should look like
df2
Index Req_ID City result
0 X Atlanta False
1 Y Seattle False
2 Z Boston True
I am new to asking question on Stack Overflow as well so pardon any common mistakes.
Considering Req ID is the matching key and the length of the dfs are not the same, you can use:
df2['Result'] = df2.Req_ID.map(dict(zip(df['Req ID'],df.Result))).astype(bool)
0 False
1 False
2 True
If lengths are equal you can use the above sol by #aws_apprentice
You can apply a bool to the 0,1's.
df2['Result'] = df1['Result'].apply(bool)
You can also map a dictionary of values.
df2['Result'] = df1['Result'].map({0: False, 1: True})
Assuming they're the same lengths you can do:
df2['Result'] = df1['Result']==1

python pandas group by and aggregate columns

I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?
Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0

How to create pandas dummies based on column values

I would like to create dummies based on column values...
This is what the df looks like
I want to create this
This is so far my approach
import pandas as pd
df =pd.read_csv('test.csv')
v =df.Values
v_set=set()
for line in v:
line=line.split(',')
for x in line:
if x!="":
v_set.add(x)
else:
continue
for val in v_set:
df[val]=''
By the above code I am able to create columns in my df like this
How do I go about updating the row values to create dummies?
This is where I am having problems.
Thanks in advance.
You could use pandas.Series.str.get_dummies. This will alllow you to split the column directly with a delimiter.
df = pd.concat([df.ID, df.Values.str.get_dummies(sep=",")], axis=1)
ID 1 2 3 4
0 1 1 1 0 0
1 2 0 0 1 1
df.Values.str.get_dummies(sep=",") will generate
1 2 3 4
0 1 1 0 0
1 0 0 1 1
Then, we do a pd.concat to glue to df together.

Add column to pandas without headers

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end.
With headers I can do it this way:
df['new'] = pd.Series([0 for x in range(len(df.index))], index=df.index)
Each not empty DataFrame has columns, index and some values.
You can add default column value and create new column filled by scalar:
df[len(df.columns)] = 0
Sample:
df = pd.DataFrame({0:[1,2,3],
1:[4,5,6]})
print (df)
0 1
0 1 4
1 2 5
2 3 6
df[len(df.columns)] = 0
print (df)
0 1 2
0 1 4 0
1 2 5 0
2 3 6 0
Also for creating new column with name the simpliest is:
df['new'] = 1

Categories

Resources