Merge rows in pandas based on a common data - python

I have a CSV file where the data of one ID is in many rows. I want to merge all the data of one ID in one row by increasing the number of columns.
Id X Y
ABC 56 23
ABC 77 74
XYX 11 51
to
Id X Y X Y
ABC 56 23 77 74
XYX 11 51
How to do it?

Related

Python combine integer columns to create multiple columns with suffix

I have a dataframe with a sample of the employee survey results as shown below. The values in the delta columns are just the difference between the FY21 and FY20 columns.
Employee leadership_fy21 leadership_fy20 leadership_delta comms_fy21 comms_fy20 comms_delta
patrick.t#abc.com 88 50 38 90 80 10
johnson.g#abc.com 22 82 -60 80 90 -10
pamela.u#abc.com 41 94 -53 44 60 -16
yasmine.a#abc.com 90 66 24 30 10 20
I'd like to create multiple columns that
i. contain the % in the fy21 values
ii. merge it with the columns with the delta suffix such that the delta values are in a ().
example output would be:
Employee leadership_fy21 leadership_delta leadership_final comms_fy21 comms_delta comms_final
patrick.t#abc.com 88 38 88% (38) 90 10 90% (10)
johnson.g#abc.com 22 -60 22% (-60) 80 -10 80% (-10)
pamela.u#abc.com 41 -53 41% (-53) 44 -16 44% (-16)
yasmine.a#abc.com 90 24 90% (24) 30 20 30% (20)
I have tried the following code but it doesn't seem to work. It might have to do with numpy not being able to combine strings. Appreciate any form of help I can get, thank you.
#create a list of all the rating columns
ratingcollist = ['leadership','comms','wellbeing','teamwork']
#create a for loop to get all the columns that match the column list
for rat in ratingcollist:
cols = df.filter(like=rat).columns
fy21cols = df[cols].filter(like='_fy21').columns
deltacols = df[cols].filter(like='_delta').columns
if len(cols) > 0:
df[f'{rat.lower()}final'] = (df[fy21cols].values.astype(str) + '%' + '(' + df[deltacols].values.astype(str) + ')')
You can do this:
def yourfunction(ratingcol):
x=df.filter(regex=f'{ratingcol}(_delta|_fy21)')
fy=x.filter(regex='21').iloc[:,0].astype(str)
delta=x.filter(regex='_delta').iloc[:,0].astype(str)
return(fy+"%("+delta+")")
yourfunction('leadership')
0 88%(38)
1 22%(-60)
2 41%(-53)
3 90%(24)
Then, using a for loop you can create your columns
for i in ratingcollist:
df[f"{i}_final"]=yourfunction(i)

Writing a dict of large dataframes to excel

I am creating dicts where the dict keys are strings and the values are large-ish pandas DataFrames. I would like to write these dicts to an excel file but the issue I'm having is that when python writes the dataframe to a csv it cuts out parts. Code:
import pandas as pd
import numpy as np
def create_random_df():
return(pd.DataFrame(np.random.randint(0,100,size=(70,26)),columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')))
dic={'key1': create_random_df() , 'key2': create_random_df()}
with open('test.csv', 'w') as f:
for key in dic.keys():
f.write("%s,%s\n"%(key,dic[key]))
This sort of outputs the format I'd like except for the following:
All of the dataframe columns are in Cell B1 and they're not complete... it's
A B C D E F G H I ... R S T U V W X Y
Z
and then the indexes and dataframe elements are all in columns A. i.e. Cells A2:A4 is
0 55 96 60 47 11 3 2 69 50 ... 3 23 26 3 15 53
78 95 49
1 72 48 12 25 32 57 11 84 5 ... 11 43 56 0 68 55
95 64 84
2 80 56 78 58 79 72 67 97 58 ... 84 34 18 21 71 20
72 36 37
I'd like the dataframes to be written to the csv in their entirety and obviously the values in discrete cells
You can try:
dic={'key1': create_random_df() , 'key2': create_random_df()}
with open('test.csv', 'w') as f:
for key in dic.keys():
df = dic[key]
df.insert(0,'Key', pd.Series([key]))
df.Key = df.Key.fillna('')
f.write(df.to_csv(index=False))

I need help building new dataframe from old one, by applying method to each row, keeping same index and columns

I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))

Iterating through the rows in mysql in python

I have a mysql database table consisting of 8 columns as given
ID C1 C2 C3 C4 C5 C6 C7
1 25 33 76 87 56 76 47
2 67 94 90 56 77 32 84
3 53 66 24 93 33 88 99
4 73 34 52 85 67 82 77
5 78 55 52 100 78 68 32
6 67 35 60 93 88 53 66
I need to fetch 3 rows of all the column except the ID column at a time. So far I did this code in python which fetches me the rows with ID values 1,2,3.
ex = MySQLdb.connect(host,port,user,passwd,db)
with ex:
ex_cur = ex.cursor()
ex.execute("SELECT C1,C2,C3,C4,C5,C6,C7 FROM table LIMIT 0, 3;")
In the second cycle I need to fetch rows with ID values 2,3,4, third cycle fetches rows with ID values 3,4,5 which should continue till the end of the database. What query should I use to iterate through the table so as to get the desired set of rows.
I believe there are three ways of doing this: (I'm going to explain at a very high level)
You can create a queue with a size limit of 3 and read in the rows as a stream. Once the queue reaches the max size of 3, do your processing, pop off the first element in your queue, and proceed with the stream. (More efficient)
You would need an iterator and reset your cursor for every set of 3 IDs that you have to do.
Since your table is relatively small (would not suggest this for larger tables), load the whole database into a data structure/into memory. Perhaps make an object for the rows and use an ORM to map rows to objects. Then you would simply have to iterate through each object, or set of 3 objects, and do the necessary processing.

How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
f = open('index.csv','wb')
write = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
for ch_row in data1:
if ( data2[row,3] == ch_row ):
write.writerow(data1[data2[row,3],:])
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
Can anyone help me solve this problem?
You need to indent your last 2 lines. Also, it looks like you are writing to the file from which you are reading.

Categories

Resources