I've got some data that looks like
tweet_id worker_id option
397921751801147392 A1DZLZE63NE1ZI pro-vaccine
397921751801147392 A3UJO2A7THUZTV pro-vaccine
397921751801147392 A3G00Q5JV2BE5G pro-vaccine
558401694862942208 A1G94QON7A9K0N other
558401694862942208 ANMWPCK7TJMZ8 other
What I would like is a single line for each tweet id, and three 6 columns identifying the worker id and the option.
It the desired output is something like
tweet_id worker_id_1 option_1 worker_id_2 option_2 worker_id_3 option 3
397921751801147392 A1DZLZE63NE1ZI pro-vaccine A3UJO2A7THUZTV pro_vaccine A3G00Q5JV2BE5G pro_vaccine
How can I achieve this with pandas?
This is about reshaping data from long to wide format. You can create a grouped count column as id to spread as new column headers and then use pivot_table(), finally rename the columns by pasting the multi-level together.
df['count'] = df.groupby('tweet_id').cumcount() + 1
df1 = df.pivot_table(values = ['worker_id', 'option'], index = 'tweet_id',
columns = 'count', aggfunc='sum')
df1.columns = [x + "_" + str(y) for x, y in df1.columns]
An alternative option to pivot_table() is unstack():
df['count'] = df.groupby('tweet_id').cumcount() + 1
df1 = df.set_index(['tweet_id', 'count']).unstack(level = 1)
df1.columns = [x + "_" + str(y) for x, y in df1.columns]
Related
I want to make a new column for each dataframe in a list of dataframes called "RING" which contains the word "RING" + the another column called "No".
here is my solution so far
df_all = [df1,df2,df3]
for df in df_all:
df["RING "] = "RING" + str(df['No'])
df_all
Is there away that doesn't require a for loop?
You are almost there:
df_all = [df1,df2,df3]
for df in df_all:
df["RING"] = "RING" + df["No"]
# If df["No"] is not of type string, cast it to string:
# df["RING"] = "RING" + df["No"].astype("str")
df_all
you can concat all dataframes in the list to get one df (then work with it):
df_all = [df1,df2,df3]
df = pd.concat(df_all, axis=0, ignore_index=True)
df["RING "] = "RING" + df['No'].astype(str)
if you want to come bak and get separate dataframes, you can do this:
df_all = [df1,df2,df3]
df1['df_id'] = 1
df2['df_id'] = 2
df3['df_id'] = 3
df = pd.concat(df_all, axis=0, ignore_index=True)
df["RING "] = "RING" + df['No'].astype(str)
#-->
df1 = df.loc[df['df_id'].eq(1)]
df2 = df.loc[df['df_id'].eq(2)]
df3 = df.loc[df['df_id'].eq(3)]
if you don't want use concat, you can try list comprehension, usually faster than for loop:
df_all = [df1,df2,df3]
def process_df(df):
df["RING "] = "RING" + df['No'].astype(str)
return df
processed_df_all = [process_df(df) for df in df_all]
#df1 = processed_df_all[0]
My question is somewhat similar to subtracting-two-columns-named-in-certain-pattern
I'm having trouble performing operations on columns that share the same root substring, without a loop. Basically I want to calculate a percentage change using columns that end with '_PY' with another column that shares the same name except for the suffix.
What's a possible one line solution, or one that doesn't involve a for loop?
url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']
PY_cols = [col for col in df.columns if col.endswith("PY")]
reg_cols = [col.split("_PY")[0] for col in PY_cols]
for k,v in zip(reg_cols,PY_cols):
df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)
df
You can use:
v = (df[df.columns[df.columns.str.endswith('_PY')]]
.rename(columns=lambda x: x.rsplit('_', maxsplit=1)[0]))
k = df[v.columns]
out = pd.concat([df, k.sub(v).div(v).mul(100).round(2).add_suffix('_YOY%')], axis=1)
Gotta subset the df into the columns you need. Then zip will pull the pairs you need to do the percent calculation.
url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']
df_cols = [col for col in df.columns]
PY_cols = [col for col in df.columns if col.endswith("PY")]
# find the matching column, where the names match without the suffix.
PY_use = [col for col in PY_cols if col.split("_PY")[0] in df_cols]
df_use = [col.split("_PY")[0] for col in PY_use]
for k,v in zip(df_use,PY_use):
df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)
You can take advantage of numpy:
py_df_array = (df[df_use].values, df[PY_use].values)
perc_dif = np.round((py_df_array[0] - py_df_array[1]) / py_df_array[1] * 100, 2)
df_perc = pd.DataFrame(perc_def, columns=[f"{col}_YOY%" for col in df_use])
df = pd.concat([df, df_perc], axis=1)
I have a sheet that looks like this.
Fleet Risk Control
Communication
Interpersonal relationships
Demographic
Demographic
Q_21086
Q_21087
Q_21088
AGE
GENDER
1
3
4
27
Male
What I'm trying to achieve is where there is a row with 'Q_' inside of it, merge that cell with the top row and return a new dataframe.
So the existing data above would become something like this:
Fleet Risk Control - Q_21086
Communication - Q_21087
Interpersonal relationships - Q_21088
1
3
4
I honestly have no idea where to even begin with something like this.
You could try this one. This is for input:
import pandas as pd
df = pd.DataFrame({'Fleet Risk Control': ['Q_21086', 1],
'Communication': ['Q_21087', 3],
'Interpersonal relationships': ['Q_21088', 4],
'Demographic': ['AGE', 27],
'Demographic 2': ['Gender', 'Male']})
Now concat the header line with the first line of df:
df.columns = df.columns + ' - ' + df.iloc[0, :]
Extract every line without the first and dropping the last both columns
df = df.iloc[1:, :-2]
# rename columns
df.columns = [x + ' - ' + y if y.startswith('Q_') else x for x, y in zip(df.columns, df.iloc[0])]
#drop not matching columns
to_drop = [c for c, _ in df.iloc[0].apply(lambda x: not x.startswith('Q_')).items() if _]
df.drop(to_drop, axis=1)[1:]
I'm looking to append a multi-index column headers to an existing dataframe, this is my current dataframe.
Name = pd.Series(['John','Paul','Sarah'])
Grades = pd.Series(['A','A','B'])
HumanGender = pd.Series(['M','M','F'])
DogName = pd.Series(['Rocko','Oreo','Cosmo'])
Breed = pd.Series(['Bulldog','Poodle','Golden Retriever'])
Age = pd.Series([2,5,4])
DogGender = pd.Series(['F','F','F'])
SchoolName = pd.Series(['NYU','UCLA','UCSD'])
Location = pd.Series(['New York','Los Angeles','San Diego'])
df = (pd.DataFrame({'Name':Name,'Grades':Grades,'HumanGender':HumanGender,'DogName':DogName,'Breed':Breed,
'Age':Age,'DogGender':DogGender,'SchoolName':SchoolName,'Location':Location}))
I want add 3 columns on top of the existing columns I already have. For example, columns [0,1,2,3] should be labeled 'People', columns [4,5,6] should be labeled 'Dogs', and columns [7,8] should be labeled 'Schools'. In the final result, it should be 3 columns on top of 9 columns.
Thanks!
IIUC, you can do:
newlevel = ['People']*4 + ['Dogs']*3 + ['Schools']*2
df.columns = pd.MultiIndex.from_tuples([*zip(newlevel, df.columns)])
Note [*zip(newlevel, df.columns)] is equivalent to
[(a,b) for a,b in zip(new_level, df.columns)]
I have a data frame with columns like
Name Date Date_x Date_y A A_x A_y..
and I need to add _z to the columns (except the Name column) that don't already have _x or _y . So, I want the output to be similar to
Name Date_z Date_x Date_y A_z A_x A_y...
I've tried
df.iloc[:,~df.columns.str.contains('x|y|Name')]=df.iloc[:,~df.columns.str.contains('x|y|Name')].add_suffix("_z")
# doesn't add suffixes and replaces columns with all nans
df.columns=df.columns.map(lambda x : x+'_z' if "x" not in x or "y" not in x else x)
#many variations of this but seems to add _z to all of the column names
How about:
df.columns = [x if x=='Name' or '_' in x else x+'_z' for x in df.columns]
You can also try:
df.rename(columns = lambda x: x if x=='Name' or '_' in x else x+'_z')
stealing slightly from Quang Hoang ;)
Add '_z' where the column stub is duplicated and without a suffix.
m = (df.columns.str.split('_').str[0].duplicated(keep=False)
& ~df.columns.str.contains('_'))
df.columns = df.columns.where(~m, df.columns+'_z')
I would use index.putmask as follows:
m = (df.columns == 'Name') | df.columns.str[-2:].isin(['_x','_y'])
df.columns = df.columns.putmask(~m, df.columns+'_z')
In [739]: df.columns
Out[739]: Index(['Name', 'Date_z', 'Date_x', 'Date_y', 'A_z', 'A_x', 'A_y'], dty
pe='object')