Creating new columns based on another column in pandas

Creating new columns based on another column in pandas - python

I'm doing some study on the below df
timestamp conversationId UserId MessageId tpMessage Message
1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live?
1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn
1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name?
1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill
1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender?
1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male
1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you?
1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35
1614606240 loi90zj8q0qv 1c890r9 o2bd10ex4b8u question Where do you live?
1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name?
1614606250 jto9034pe0i5 1c489rl 96jd89i55b72 answer Robert
1614606267 jto9034pe0i5 1c489rl 33yd1445d6ut answer Brandom
1614606267 loi90zj8q0qv 1c890r9 o2bd10ex4b8u answer London
1614606287 jto9034pe0i5 1c489rl b7q489iae77t answer Connor
I need to "split" the timestamp column in 2 based on the tpMessage column, the contidions are:
df['ts_question'] = np.where(df['tpMessage']=='question', df['timestamp'],0)
df['ts_answer'] = np.where(df['tpMessage']=='answer', df['timestamp'],0)
this is giving me "0" values for both columns when the conditions don't match and I'm stuck in how to move forward after that
my goal is to get this output:
ts_question ts_answer conversationId UserId
1614578324 1614578881 ceb9004ae9d3 1c376ef
1614583764 1614590885 ceb9004ae9d3 1c376ef
1614594952 1614602036 ceb9004ae9d3 1c376ef
1614602581 1614606219 ceb9004ae9d3 1c376ef
1614606240 1614606250 jto9034pe0i5 1c489rl
1614606240 1614606267 o2bd10ex4b8u 1c890r9
1614606240 1614606267 o2bd10ex4b8u 1c489rl
1614606240 1614606287 jto9034pe0i5 1c489rl
note that I can have 1 or more answers for the question "What's your name"?
Edit : I found out that I can have N conversations happening at the same timestamp(i.e. 1614606240 and 1614606267)
could you guys help me on that

You can use merge:
# Assuming dataframe is already sorted by timestamp)
df['thread'] = df['tpMessage'].eq('question').cumsum()
# Split your data in two new dataframes: questions and answers
dfq = df[df['tpMessage'] == 'question'].rename(columns={'timestamp': 'ts_question'})
dfa = df[df['tpMessage'] == 'answer'].rename(columns={'timestamp': 'ts_answer'})
# Merge them on conversation, user id and thread
cols = ['ts_question', 'ts_answer', 'conversationId', 'UserId']
out = dfa.merge(dfq, on=['conversationId', 'UserId', 'thread'], how='outer')[cols]
Output:
>>> out
ts_question ts_answer conversationId UserId
0 1614578324 1614578881 ceb9004ae9d3 1c376ef
1 1614583764 1614590885 ceb9004ae9d3 1c376ef
2 1614594952 1614602036 ceb9004ae9d3 1c376ef
3 1614602581 1614606219 ceb9004ae9d3 1c376ef
4 1614606240 1614606250 jto9034pe0i5 1c489rl
5 1614606240 1614606267 jto9034pe0i5 1c489rl
6 1614606240 1614606287 jto9034pe0i5 1c489rl

Related

how to search user entered value in a list? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
This post was edited and submitted for review 8 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
(s213,bill,72,59,45)
(s214,john,88,70,80)
(s215,anne,52,61,44)
(s216,sliva,44,50,35)
above values in the brackets are the marks.txt
and using below code it can convert in to a list
file=open('marks.txt','r')
marks=file.readlines()
for a in marks:
x=a.split(',')
print(x)
I want to user to enter wished student number(s213) and display the student name and the average

You probably want to use a nested dictionary. Here, each key is a student number and each value is another dictionary that stores the student's name and scores.
Code:
students = {"s213": {"name": "bill", "scores": [72, 59, 45]},
"s214": {"name": "john", "scores": [88 , 70, 80]},
"s216": {"name": "sliva", "scores": [44, 50, 35]}}
student_number = input("Please enter the student number: ")
print("Name: " + students[student_number]["name"])
grades = students[student_number]["scores"]
print("Average grade: " + str(sum(grades) / len(grades)))
Output:
Please enter the student number:
s214
Name: john
Average grade: 79.33333333333333

use pandas for loading a csv (what im assuming the format for your data is) and for adding an 'average' column for easier getting
tmp.csv:
student_id,name,grade1,grade2,grade3
s213,bill,72,59,45
s214,john,88,70,80
s215,anne,52,61,44
s216,sliva,44,50,35
code
import pandas as pd
# load csv
df = pd.read_csv('tmp.csv')
df['average'] = df.mean(axis=1)
print(df)
search_name = input('enter student name to get an average for: ')
print(df['average'][df['name'] == search_name])
df printed:
student_id name grade1 grade2 grade3 average
0 s213 bill 72 59 45 58.666667
1 s214 john 88 70 80 79.333333
2 s215 anne 52 61 44 52.333333
3 s216 sliva 44 50 35 43.000000
result from search
enter student name to get an average for: bill
0 58.666667
Name: average, dtype: float64
NOTE: the mean method call used to get averages will try to average ALL elements of the row. string's can't be averaged so they're ignored and instead all numbers are averaged. if your data has numbers included that shouldn't be averaged, the logic will need to change

How to merge duplicate records to a single record in dataframe Python? [duplicate]

This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Closed 11 months ago.
I read a csv file to a dataframe as bellow:
ID
StudentCode
Name
Birth
Abc001
S-01
John
03/10/2000
Abc002
S-01
John
03/10/2000
Abc003
S-01
John
03/10/2000
Abc004
S-02
Mark
12/08/2001
Abc005
S-02
Mark
12/08/2001
Abc006
S-03
Ernst
01/10/2005
...
...
...
...
I have tried to convert to other dataframe like:
StudentCode
Name
Birth
ID
S-01
John
03/10/2000
Abc001; Abc002; Abc003
S-02
Mark
12/08/2001
Abc004; Abc005
S-03
Ernst
01/10/2005
Abc006
...
...
...
...
Are there methods with dataframe in Python that we can convert like above?
Thanks,

import io
temp = io.StringIO("""ID,StudentCode,Name,Birth
0,Abc001,S-01,John,03/10/2000
1,Abc002,S-01,John,03/10/2000
2,Abc003,S-01,John,03/10/2000
3,Abc004,S-02,Mark,12/08/2001
4,Abc005,S-02,Mark,12/08/2001
5,Abc006,S-03,Ernst,01/10/2005""")
df = pd.read_csv(temp, sep=",")
df.groupby(["StudentCode","Name","Birth","ID"]).mean()
StudentCode Name Birth ID
S-01 John 03/10/2000 Abc001
Abc002
Abc003
S-02 Mark 12/08/2001 Abc004
Abc005
S-03 Ernst 01/10/2005 Abc006

Have you checked the groupby function?

Update a dataframe iteratively

I have a dataframe:
QID URL Questions Answers Section QType Theme Topics Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
1113 1096 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing To what extent are the following factors considerations in your choice of flight? ['Very important consideration', 'Important consideration', 'Neutral', 'Not an important consideration', 'Do not consider'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['extent', 'follow', 'factor', 'consider', 'choic', 'flight'] Very important consideration Important consideration Neutral Not an important consideration Do not consider NaN NaN
1116 1097 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing How far in advance do you typically book your tickets? ['0-2 months in advance', '2-4 months in advance', '4-6 months in advance', '6-8 months in advance', '8-10 months in advance', '10-12 months in advance', '12+ months in advance'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['advanc', 'typic', 'book', 'ticket'] 0-2 months in advance 2-4 months in advance 4-6 months in advance 6-8 months in advance 8-10 months in advance 10-12 months in advance 12+ months in advance
with rows of which I want to change a few lines that are actually QuestionGrid titles, with new lines that also represent the answers. I have a other, Pickle, which contains the information to build the lines that will update the old ones. Each time an old line will be transformed into several new lines (I specify this because I do not know how to do it).
These lines are just the grid titles of questions like the following one:
Expected dataframe
I would like to insert them in the original dataframe, instead of the lines where they match in the 'Questions' column, as in the following dataframe:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1096_S01 'The airline/company you fly with'
1096_S02 'The departure airport'
1096_S03 'Duration of flight/route'
1096_S04 'Baggage policy'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
1097_S01 ...
...
What I tried
import pickle
qa = pd.read_pickle(r'Python/interns.p')
df = pd.read_csv("QuestionBank.csv")
def isGrid(dic, df):
'''Check if a row is a row related to a Google Forms grid
if it is a case update this row'''
d_answers = dic['answers']
try:
answers = d_answers[2]
if len(answers) > 1:
# find the line in df and replace the line where it matches by the lines
update_lines(dic, df)
return df
except TypeError:
return df
def update_lines(dic, df):
'''find the line in df and replace the line where it matches
with the question in dic by the new lines'''
lines_to_replace = df.index[df['Questions'] == dic['question']].tolist() # might be several rows and maybe they aren't all to replace
# I check there is at least a row to replace
if lines_to_replace:
# I replace all rows where the question matches
for line_to_replace in lines_to_replace:
# replace this row and the following by the following dataframe
questions = reduce(lambda a,b: a + b,[data['answers'][2][x][3] for x in range(len(data['answers'][2]))])
ind_answers = dic["answers"][2][0][1]
answers = []
# I get all the potential answers
for i in range(len(ind_answers)):
answers.append(reduce(lambda a,b: a+b,[ind_answers[i] for x in range(len(questions))])) # duplicates as there are many lines with the same answers in a grid, maybe I should have used set
answers = {f"Answer{i}": answers[i] for i in range(0, len(answers))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df1.index = df1.index / 10 + line_to_replace
df = df1.combine_first(df)
return df
I did a Colaboratory notebook if needed.
What I obtain
But the dataframe is the same size before and after we do this. In effect, I get:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...

Convert a text file consisting of strings into a dictionary

I want to know how to convert a text file consisting of strings into a dictionary. My text file looks like this:
Donald Trump, 45th US President, 71 years old
Barack Obama, 44th US President, 56 years old
George W. Bush, 43rd US President, 71 years old
I want to be able to convert that text file into a dictionary being:
{Donald Trump: 45th US President, 71 years old, Barack Obama: 44th US President, 56 years old, George W. Bush: 43rd US President, 71 years old}
How would I go about doing this? Thanks!
I tried to do it by doing this:
d = {}
with open('presidents.txt', 'r') as f:
for line in f:
key = line[0]
value = line[1:]
d[key] = value

Is this what you're looking for?
d = {}
with open("presidents.txt", "r") as f:
for line in f:
k, v, z = line.strip().split(",")
d[k.strip()] = v.strip(), z.strip()
f.close()
print(d)
The final output looks like this:
{'Donald Trump': ('45th US President', '71 years old'), 'Barack Obama': ('44th US President', '56 years old'), 'George W. Bush': ('43rd US President', '71 years old')}

You can use pandas for this:
import pandas as pd
df = pd.read_csv('file.csv', delimiter=', ', header=None, names=['Name', 'President', 'Age'])
d = df.set_index(['Name'])[['President', 'Age']].T.to_dict(orient='list')
# {'Barack Obama': ['44th US President', '56 years old'],
# 'Donald Trump': ['45th US President', '71 years old'],
# 'George W. Bush': ['43rd US President', '71 years old']}

I wrote a code for 2 or 3 inputs but for many inputs what should i do [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
print("Store Room Stock Category")
print("")
print("")
Stockmachinary1 = input("Enter the stock material name:")
Stockmachinary1price=int(input("Enter the stock material price:"))
Stockmachinary2=input("Enter the stock material name:")
Stockmachinary2price=int(input("Enter the stock material price:"))
Stockmachinary3=input("Enter the stock material name:")
Stockmachinary3price=int(input("Enter the stock material price:"))
Totalstockprice=Stockmachinary1price+Stockmachinary1price+Stockmachinary3price
import pandas as pd
stock = pd.DataFrame({"stock":[Stockmachinary1,Stockmachinary2,Stockmachinary3,"totalcoststock"],\
"price":[Stockmachinary1price,Stockmachinary2price,Stockmachinary1price,Totalstockprice]})
stock=stock[["stock","price"]]
stock
Totalstockprice

If you talking about not write too many codes, I think you should use loops, and for-loop like below:
print("Store Room Stock Category")
print("")
print("")
StockmachinaryNames = []
StockmachinaryPrice = []
counts = int(input("Enter the stock material you want input:"))
for i in range(counts):
Name = input("Enter the stock material name:")
Price=int(input("Enter the stock material price:"))
StockmachinaryNames.append(Name)
StockmachinaryPrice.append(Price)
TotalstockPrice = sum(StockmachinaryPrice)
StockmachinaryNames.append("totalcoststock")
StockmachinaryPrice.append(TotalstockPrice)
import pandas as pd
stock = pd.DataFrame({"stock":StockmachinaryNames,\
"price":StockmachinaryPrice})
stock=stock[["stock","price"]]
print(stock)
print(TotalstockPrice)
But if you talking about bach data input, I think you may need csv or other file format for input. And pandas work well with it. there is the help page:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating new columns based on another column in pandas - python

Related

how to search user entered value in a list? [closed]

How to merge duplicate records to a single record in dataframe Python? [duplicate]

Update a dataframe iteratively

Convert a text file consisting of strings into a dictionary

I wrote a code for 2 or 3 inputs but for many inputs what should i do [closed]

Categories

Resources