Pandas: How to join csv columns of no header? - python

I have csv data like the following.
1,2,3,4
a,b,c,d
1,2,3,4 is not a csv header. It is data.
That values is all strings data.
I want join columns of index (of list) of 1 and 2 by Pandas.
I want get result like the following.
Result data is strings.
1,23,4
a,bc,d
Python's code is like the following.
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
vals = lines[0]
s = vals[0] + ',' + (vals[1] + vals[2]) + ',' + vals[3] + '\n'
vals = lines[1]
s += vals[0] + ',' + (vals[1] + vals[2]) + ',' + vals[3] + '\n'
print(s)
How to you do it?

If you wand to use pandas, you could create new column and remove old ones:
import pandas as pd
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
df = pd.DataFrame(lines)
# Create new column
df['new_col'] = df[1] + df[2]
print(df)
# 0 1 2 3 new_col
# 0 1 2 3 4 23
# 1 a b c d bc
# Remove old columns if needed
df.drop([1, 2], axis=1, inplace=True)
print(df)
# 0 3 new_col
# 0 1 4 23
# 1 a d bc
If you want columns to be in specific order, use something like this:
print(df[[0, 'new_col', 3]])
# 0 new_col 3
# 0 1 23 4
# 1 a bc d
But it's better to save headers in csv

You can loop over it using for or a list-comprehension.
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
vals = [','.join([w, f'{x}{y}', *z]) for w, x, y, *z in lines]
s = '\n'.join(vals)
print(x)
# prints:
1,23,4
a,bc,d

You can do something like this.
import pandas as pd
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
df = pd.DataFrame(lines)
df['new_col'] = df.iloc[:, 1] + df.iloc[:, 2]
print(df)
Output
You can then drop the columns you don't want.

Since OP specified pandas, here's a solution that may work.
Once in pandas, eg with pd.read_csv()
You can simply concatenate text (object) columns with +
import pandas as pd
lines = [ ['1', '2', '3', '4'],
['a', 'b', 'c', 'd']]
df = pd.DataFrame(lines)
df[1] = df[1]+df[2]
df.drop(columns=2, inplace=True)
df
# 0 1 3
# 0 1 23 4
# 1 a bc d
Should give you what you want in a pandas dataframe.

Related

Fastest way to fill multiple columns by a given condition on other columns pandas

I'm working with a very long dataframe, so I'm looking for the fastest way to fill several columns at once given certain conditions.
So let's say you have this dataframe:
data = {
'col_A1':[1,'','',''],
'col_A2':['','','',''],
'col_A3':['','','',''],
'col_B1':['','',1,''],
'col_B2':['','','',''],
'col_B3':['','','',''],
'col_C1':[1,1,'',''],
'col_C2':['','','',''],
'col_C3':['','','',''],
}
df = pd.DataFrame(data)
df
Input:
col_A1
col_A2
col_A3
col_B1
col_B2
col_B3
col_C1
col_C2
col_C3
1
1
1
1
And we want to find all '1' values in columns A1,B1 and C1 and then replace other values in the matching rows and columns A2,A3, B2,B3 and C2,C3 as well:
Output:
col_A1
col_A2
col_A3
col_B1
col_B2
col_B3
col_C1
col_C2
col_C3
1
2
3
1
2
3
1
2
3
1
2
3
I am currently iterating over columns A and looking for where A1 == 1 matches and then replacing the values for A2 and A3 in the matching rows, and the same for B, C...
But speed is important, so I'm wondering if I can do this for all columns at once, or in a more vectorized way.
You can use:
# extract letters/numbers from column names
nums = df.columns.str.extract('(\d+)$', expand=False)
# ['1', '2', '3', '1', '2', '3', '1', '2', '3']
letters = df.columns.str.extract('_(\D)', expand=False)
# ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
# or in a single line
# letters, nums = df.columns.str.extract(r'(\D)(\d+)$').T.to_numpy()
# compute a mask of values to fill
mask = df.ne('').groupby(letters, axis=1).cummax(axis=1)
# NB. alternatively use df.eq('1')...
# set the values
df2 = mask.mul(nums)
output:
col_A1 col_A2 col_A3 col_B1 col_B2 col_B3 col_C1 col_C2 col_C3
0 1 2 3 1 2 3
1 1 2 3
2 1 2 3
3

PANDAS - Rename and combine like columns

I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!
This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

How to append a longer list to dataframe

I wanna append a longer list to dataframe .But get an error ValueError: Length of values (4) does not match length of index (3)
import pandas as pd
df = pd.DataFrame({'Data': ['1', '2', '3']})
df['Data2'] =['1', '2', '3', '4']
print(df)
How can I fix it .
Use DataFrame.reindex for add new rows by maximal length by new list or original DataFrame, if length of list should be changed, sometimes same length or sometimes length is shorter:
df = pd.DataFrame({'Data': ['1', '2', '3']})
L = ['1', '2', '3', '4']
df = df.reindex(range(max(len(df), len(L))))
df['Data2'] = L
print (df)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
If always is length of list longer:
df = df.reindex(range(len(L)))
df['Data2'] = L
You can try using pd.concat here but convert your list to a Series then use pd.concat
l = ['1', '2', '3', '4']
pd.concat([df, pd.Series(l, name='Data2')], axis=1)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4

Sort or groupby dataframe in python using given string

I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a

Python Pandas lookup and replace df1 value from df2

I have two dataframes
df df2
df column FOUR matches with df2 column LOOKUP COL
I need to match df column FOUR with df2 column LOOKUP COL and replace df column FOUR with the corresponding values from df2 column RETURN THIS
The resulting dataframe could overwrite df but I have it listed as result below.
NOTE: THE INDEX DOES NOT MATCH ON EACH OF THE DATAFRAMES
df = pd.DataFrame([['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'h'],
['j', 'k', 'l', 'm'],
['x', 'y', 'z', 'w']])
df.columns = ['ONE', 'TWO', 'THREE', 'FOUR']
ONE TWO THREE FOUR
0 a b c d
1 e f g h
2 j k l m
3 x y z w
df2 = pd.DataFrame([['a', 'b', 'd', '1'],
['e', 'f', 'h', '2'],
['j', 'k', 'm', '3'],
['x', 'y', 'w', '4']])
df2.columns = ['X1', 'Y2', 'LOOKUP COL', 'RETURN THIS']
X1 Y2 LOOKUP COL RETURN THIS
0 a b d 1
1 e f h 2
2 j k m 3
3 x y w 4
RESULTING DF
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
You can use Series.map. You'll need to create a dictionary or a Series to use in map. A Series makes more sense here but the index should be LOOKUP COL:
df['FOUR'] = df['FOUR'].map(df2.set_index('LOOKUP COL')['RETURN THIS'])
df
Out:
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
df['Four']=[df2[df2['LOOKUP COL']==i]['RETURN THIS'] for i in df['Four']]
Should be something like sufficient to do the trick? There's probably a more pandas native way to do it.
Basically, list comprehension - We generate a new array of df2['RETURN THIS'] values based on using the lookup column as we iterate over the i in df['Four'] list.

Categories

Resources