Concatenate value from 3 excel columns in Python - python

very new to Python here.
I'm triying to concatenate value from 3 columns from an excel sheet into 1 columns.
I do have about 300-400 rows to do
Values are like this
COl1 COL 2 COL3
CNMG 432 EMU
TNMG 332 ESU
...
Output should be
COL3
CNMG432EMU
TNMG332ESU
...
I tried about every tutorial in Pandas I could find, but nothing seams to works since I have STR and INT
Thanks in advance

seems like some simple string concatenation should do the trick
df['concat'] = df['COL1'] + df['COL 2'].astype(str) + df['COL3']
if you have ints, you'll need to cast them as strings, you can check which columns with a simple print(df.dtypes)
if you have ints or floats you'll need to use .astype(str)
print(df)
COl1 COL2 COL3 concat
0 CNMG 432 EMU CNMG432EMU
1 TNMG 332 ESU TNMG332ESU

df["COL3"]=df["COL1"]+df["COL2"].astype(str)+df["COL3]

You can also do this pretty easily in pylightxl https://pylightxl.readthedocs.io/en/latest/
import pylightxl as xl
db = xl.readxl(‘excelfile.xlsx’)
cat_3_columns = list(zip(db.ws(‘Sheet1’).col(1), db.ws(‘Sheet1’).col(2), db.ws(‘Sheet1’).col(3)))

Related

How to replace a string value with the means of a column's groups in the entire dataframe

I have a large dataset with 400columns and 30,000 rows. The dataset is all numerical but some columns have weird string values in them (denoted as "#?") instead of being blank. This changes the dtypes of the columns that have "#?" into object type. (150 columns object dtype)
I need to convert all the columns into float or int dtypes, and then fill the normal NaN values in the data, with means of a column's groups. (e.g: means of X, means of Y in each column)
col1 col2 col3
X 21 32
X NaN 3
Y Nan 5
My end goal is to apply this to the entire data:
df.groupby("col1").transform(lambda x: x.fillna(x.mean()))
But I can't apply this for the columns that have "#?" in them, they get dropped.
I tried replacing the #? with a numerical value, and then convert all the columns into float dtype, which works, but the replaced values also should be included in the above code.
I thought about replacing #? with an weird value like -123.456 so that it doesn't get mixed with actual data points, and maybe replace all the -123.456 with the means of column groups but the -123.456 would need to be excluded from the mean. But I just don't know how that would even work. If I convert it back to NaN again, the dtype changes back to object.
I think the best way to go about it would be directly replacing the #? with the column group means.
Any ideas?
edit: I'm so dumb lol
df=df.replace('#?', '').astype(float, errors = 'ignore')
this works.
Use:
print (df)
col1 col2 col3
0 X 21 32
1 X #? 3
2 Y NaN 5
df = (df.set_index('col1')
.replace(r'#\?', np.nan, regex=True)
.astype(float)
.groupby("col1")
.transform(lambda x: x.fillna(x.mean())))
print (df)
col2 col3
col1
X 21.0 32.0
X 21.0 3.0
Y NaN 5.0

removing specific words from a dataset [duplicate]

I have a pandas data frame, which looks like the following:
col1 col2 col3 ...
field1:index1:value1 field2:index2:value2 field3:index3:value3 ...
field1:index4:value4 field2:index5:value5 field3:index5:value6 ...
The field is of int type, index is of int type and value could be int or float type.
I want to convert this data frame into the following expected output:
col1 col2 col3 ...
index1:value1 index2:value2 index3:value3 ...
index4:value4 index5:value5 index5:value6 ...
I want to remove the all field: values from all the cells. How to do this?
EDIT: An example of a cell looks like: 1:1:1.0445731675303e-06 and I would like to reduce such strings to 1:1.0445731675303e-06, in all the cells.
Given
>>> df
col1 col2 col3
0 1:index1:value1 2:index2:value2 3:index3:value3
1 1:index4:value4 2:index5:value5 3:index5:value6
you can use
>>> df.apply(lambda s: s.str.replace('^\d+:', '', regex=True))
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
The regex '^\d+:' matches the beginnings of strings that start with a sequence of numbers followed by a colon.
Try this:
df = df.applymap(lambda x: ':'.join(str(x).split(':')[1:]))
print(df)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
Possible other way is to basically split by phrase after first colon and extract using .str[index]
df.apply(lambda s: s.str.split('(^[a-z0-9]+\:(.*))').str[-2])
Another possible solution is to run the string processing in a list comprehension, and create a new dataframe, using the old dataframe's column names :
result = [[":".join(word.split(":")[1:])
for word in ent]
for ent in df.to_numpy()]
pd.DataFrame(result, columns = df.columns)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
This is faster than running an applymap or apply... string processing is usually much faster within python than Pandas.

Adding a column with one single categorical value to a pandas dataframe

I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("catgegory")
Do I really need to write df["col"] three times in order to achieve this?
After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.
df = df.assign(col="hello").astype({'col':'category'})
df.dtypes
A int64
col category
dtype: object
That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.
This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.
df = pd.DataFrame({'A':[1,2,3,4]})
df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting
col2 = lambda x:x['A']**2, #Define column based on existing columns
col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns
.astype({'col1':'category',
'col2':'float'}))
print(df)
print(df.dtypes)
A col1 col2 col3
0 1 hello 1.0 1.0
1 2 hello 4.0 2.0
2 3 hello 9.0 3.0
3 4 hello 16.0 4.0
A int64
col1 category #<-changed dtype
col2 float64 #<-changed dtype
col3 float64
dtype: object
This solution surely solves the first point, not sure about the second:
df['col'] = pd.Categorical(('hello' for i in len(df)))
Essentially
we first create a generator of 'hello' with length equal to the number of records in df
then we pass it to pd.Categorical to make it a categorical column.

Pandas Data frame group by one column whilst multiplying others

I am using python with pandas imported to manipulate some data from a csv file I have. Just playing around to try and learn something new.
I have the following data frame:
I would like to group the data by col1 so that I get the following result. Which is a groupby on col1 and col3 and col4 multiplied together.
I have been watching some youtube videos and reading some similar questions on stack overflow but I am having trouble. So far I have the following which involves creating a new Col to hold the result of Col3 x Col4:
df['Col5'] = df.Col3 * df.Col4
gf = df.groupby(['col1', 'Col5'])
You can use solution without creating new column, you can multiple columns and aggregate by column df['Col1'] with aggregate sum, it is syntactic sugar:
gf = (df.Col3 * df.Col4).groupby(df['Col1']).sum().reset_index(name='Col2')
print (gf)
Col1 Col2
0 12345 38.64
1 23456 2635.10
2 45678 419.88
Another solution is possible create index by Col1 by set_index, multiple columns by prod and last sum by index by level=0:
gf = df.set_index('Col1')[['Col3','Col4']].prod(axis=1).sum(level=0).reset_index(name='Col2')
Almost, but you are grouping by too many columns in the end. Try:
gf = df.groupby('Col1')['Col5'].sum()
Or to get it as a dataframe, rather than Col1 as an index (I'm judging that this is what you want from your image), include as_index=False in your groupby:
gf = df.groupby('Col1', as_index=False)['Col5'].sum()

Changing values in a dataframe column based off a different column (python)

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Categories

Resources