Pandas Data frame group by one column whilst multiplying others - python

I am using python with pandas imported to manipulate some data from a csv file I have. Just playing around to try and learn something new.
I have the following data frame:
I would like to group the data by col1 so that I get the following result. Which is a groupby on col1 and col3 and col4 multiplied together.
I have been watching some youtube videos and reading some similar questions on stack overflow but I am having trouble. So far I have the following which involves creating a new Col to hold the result of Col3 x Col4:
df['Col5'] = df.Col3 * df.Col4
gf = df.groupby(['col1', 'Col5'])

You can use solution without creating new column, you can multiple columns and aggregate by column df['Col1'] with aggregate sum, it is syntactic sugar:
gf = (df.Col3 * df.Col4).groupby(df['Col1']).sum().reset_index(name='Col2')
print (gf)
Col1 Col2
0 12345 38.64
1 23456 2635.10
2 45678 419.88
Another solution is possible create index by Col1 by set_index, multiple columns by prod and last sum by index by level=0:
gf = df.set_index('Col1')[['Col3','Col4']].prod(axis=1).sum(level=0).reset_index(name='Col2')

Almost, but you are grouping by too many columns in the end. Try:
gf = df.groupby('Col1')['Col5'].sum()
Or to get it as a dataframe, rather than Col1 as an index (I'm judging that this is what you want from your image), include as_index=False in your groupby:
gf = df.groupby('Col1', as_index=False)['Col5'].sum()

Related

Adding a column with one single categorical value to a pandas dataframe

I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("catgegory")
Do I really need to write df["col"] three times in order to achieve this?
After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.
df = df.assign(col="hello").astype({'col':'category'})
df.dtypes
A int64
col category
dtype: object
That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.
This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.
df = pd.DataFrame({'A':[1,2,3,4]})
df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting
col2 = lambda x:x['A']**2, #Define column based on existing columns
col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns
.astype({'col1':'category',
'col2':'float'}))
print(df)
print(df.dtypes)
A col1 col2 col3
0 1 hello 1.0 1.0
1 2 hello 4.0 2.0
2 3 hello 9.0 3.0
3 4 hello 16.0 4.0
A int64
col1 category #<-changed dtype
col2 float64 #<-changed dtype
col3 float64
dtype: object
This solution surely solves the first point, not sure about the second:
df['col'] = pd.Categorical(('hello' for i in len(df)))
Essentially
we first create a generator of 'hello' with length equal to the number of records in df
then we pass it to pd.Categorical to make it a categorical column.

Pandas creating new columns based on conditions

I have two dataframes df1 and df2. df1 is a dataframe with various columns and df2 is a dataframe with only one column col2, which is a list of words.
It is obviously wrong, but my code so far is: df1["col_new"] = df1[df1["col1"]].str.contains(df2["col2"])
Basically, I want to create a new column called col_new in df1 that has copied values from col2 in df2 if the values are partial matches to values in col1 in df1.
For example, if col2 = "apple" and col1 = "im.apple3", then I want to copy or assign the value "apple" to col_new and so on.
Another question I have is to find the index/position of second uppercase letter in a string in col1 in df1.
I found a similar question on here and wrote this code: df["sec_upper"] = df["col1"].apply(lambda x: re.research("[A-Z]+{2}",x).span())[1] but I get an error saying "multiple repeat at position 6".
Can someone please help me out? Thank you in advance!
EDIT2: First problem solved. Can anyone please help me with the second problem?
EDIT1:
Example dataframes:
df1
col1
im.apple3
Cookiemm
Hi_World123
df2
col2
apple
cookie
world
candy
soda
Expected output:
col1 new_col sec_upper
im.apple3 apple NaN
Cookiemm cookie NaN
Hi_World123 world 4
Try this:
df1['new_col'] = df1['col1'].str.lower().str.extract(f"({'|'.join(df2['col2'])})")
Output:
col1 new_col
0 im.apple3 apple
1 Cookiemm cookie
2 Hi_World123 world

Concatenate value from 3 excel columns in Python

very new to Python here.
I'm triying to concatenate value from 3 columns from an excel sheet into 1 columns.
I do have about 300-400 rows to do
Values are like this
COl1 COL 2 COL3
CNMG 432 EMU
TNMG 332 ESU
...
Output should be
COL3
CNMG432EMU
TNMG332ESU
...
I tried about every tutorial in Pandas I could find, but nothing seams to works since I have STR and INT
Thanks in advance
seems like some simple string concatenation should do the trick
df['concat'] = df['COL1'] + df['COL 2'].astype(str) + df['COL3']
if you have ints, you'll need to cast them as strings, you can check which columns with a simple print(df.dtypes)
if you have ints or floats you'll need to use .astype(str)
print(df)
COl1 COL2 COL3 concat
0 CNMG 432 EMU CNMG432EMU
1 TNMG 332 ESU TNMG332ESU
df["COL3"]=df["COL1"]+df["COL2"].astype(str)+df["COL3]
You can also do this pretty easily in pylightxl https://pylightxl.readthedocs.io/en/latest/
import pylightxl as xl
db = xl.readxl(‘excelfile.xlsx’)
cat_3_columns = list(zip(db.ws(‘Sheet1’).col(1), db.ws(‘Sheet1’).col(2), db.ws(‘Sheet1’).col(3)))

Reorder DataFrame rows inplace

Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.
Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi

Grouping by many columns in Pandas

I basically have a dataset that looks as follows
Col1 Col2 Col3 Count
A B 1 50
A B 1 50
A C 20 1
A D 17 2
A E 5 70
A E 15 20
Suppose it is called data. I basically do data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum(), which should give me this:
Col1 Col2 Col3 Count
A B 1 100
A C 20 1
A D 17 2
A E 5 70
A E 15 20
However, this returns an empty dataset, which does have the columns I want but no rows. The only caveat is that the by parameter is getting calculated dynamically, instead of fixed (thats because the columns might change, although Count will always be there).
Any ideas on why this could be failing, and how to fix it?
EDIT: Further searching revealed that pandas' groupby removes rows that have NULL at any column. This is a problem for me because every single column might be NULL. Hence, the actual question is: any reasonable way to deal with NULLs and still use groupby?
Would love to be corrected here, but I'm not sure if there is a clean way to handle missing data. As you noted, Pandas will just exclude rows from groupby that contain NaN values
You could fill the NaN values with something beyond the range of your data:
data = pd.read_csv("c:/Users/simon/Desktop/data.csv")
data.fillna(-999, inplace=True)
new = data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum()
Which is messy because it wont add those values to the correct group by for the summation. But theres no real way to groupby something thats missing
Another method might be to fill each column separately with some missing value that is appropriate for that variable.

Categories

Resources