Python combine 2000+ csv columns into a single column - python

I'm trying to combine values of multiple columns into a single column. Suppose I have a csv with the following data
col1,col2,col3,col4
1,2,3,4
6,2,4,6
2,5,6,2
I want it to become a single column with the values concatenated separated by a blank space
col1
1 2 3 4
6 2 4 6
2 5 6 2
The number of columns is 2000+ so having the columns statically concatenated will not do.

I have no idea why you would want such a design. But you can aggregate across axis=1
df.astype(str).agg(' '.join, 1).to_frame('col')
col
0 1 2 3 4
1 6 2 4 6
2 2 5 6 2

I would try using pandas. This will find all of the column names and then concat the values for each row across all columns and save it as a new dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
cols = df.columns
df = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
The output for this csv file
c1,c2,c3
1,2,3
4,5,6
7,8,9
Is
Index(['c1', 'c2', 'c3'], dtype='object')
0 1 2 3
1 4 5 6
2 7 8 9
Where the ['c1', 'c2', 'c3'] is all the column names concatenated.

Setting things up:
import numpy as np
import pandas as pd
#generating random int dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
First case(by hand):
str_df1 = df.iloc[:, 0].apply(str) + " " + df.iloc[:, 1].apply(str) + " " + df.iloc[:, 2].apply(str) + " " + df.iloc[:, 3].apply(str)
Second case(generic):
str2_df = df.iloc[:, 0].apply(str)
for i in range(1, df.shape[1]):
str2_df += " " + df.iloc[:, i].apply(str)
the actual code here
the results
Hope I have helped.

Related

Update last column header dynamically - pandas

I'm hoping to update the last column in a pandas df using the first column header as a prefix. Using below as an example I want to update the col Z to include col X as X_Z.
import pandas as pd
df = pd.DataFrame({
'X' : [1,2,3],
'Y' : [1,2,3],
'Z' : [1,2,3],
})
# Update all cols to include a consistent suffix
df.columns.values[-1:] = [str(col) + '_Col' for col in df.columns[-1:]]
# Update last col to include hard coded string
df.columns = [str(col) + '_Col' for col in df.columns]
Please note: I don't want to update this manually using the function below. The column headers will vary so I don't want to go back to check what the first column is and add it. I'm hoping to handle all cases.
df.rename(columns={'Z': 'X_Z'}, inplace=True)
Intended Output:
X Y X_Z
0 1 1 1
1 2 2 2
2 3 3 3
We can do this using list indexing and f-strings:
cols = df.columns
df = df.rename(columns={cols[-1]:f'{cols[0]}_{cols[-1]}'})
X Y X_Z
0 1 1 1
1 2 2 2
2 3 3 3
Or we can adjust our columns list by index, then pass this adjusted list back:
cols = df.columns.tolist()
cols[-1] = f'{cols[0]}_{cols[-1]}'
df.columns = cols
X Y X_Z
0 1 1 1
1 2 2 2
2 3 3 3
Bonus: weird out of the box list comprehension:
[df.columns[0] + '_' + col if idx+1 == df.shape[1] else col for idx, col in enumerate(df.columns)]
# Out
['X', 'Y', 'X_Z']

Save pandas pivot_table to include index and columns names

I want to save a pandas pivot table for human reading, but DataFrame.to_csv doesn't include the DataFrame.columns.name. How can I do that?
Example:
For the following pivot table:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [6, 7, 8]])
>>> df.columns = list("ABC")
>>> df.index = list("XY")
>>> df
A B C
X 1 2 3
Y 6 7 8
>>> p = pd.pivot_table(data=df, index="A", columns="B", values="C")
When viewing the pivot table, we have both the index name ("A"), and the columns name ("B").
>>> p
B 2 7
A
1 3.0 NaN
6 NaN 8.0
But when exporting as a csv we lose the columns name:
>>> p.to_csv("temp.csv")
===temp.csv===
A,2,7
1,3.0,
6,,8.0
How can I get some kind of human-readable output format which contains the whole of the pivot table, including the .columns.name ("B")?
Something like this would be fine:
B,2,7
A,,
1,3.0,
6,,8.0
Yes, it is possible by append helper DataFrame, but reading file is a bit complicated:
p1 = pd.DataFrame(columns=p.columns, index=[p.index.name]).append(p)
p1.to_csv('temp.csv',index_label=p.columns.name)
B,2,7
A,,
1,3.0,
6,,8.0
#set first column to index
df = pd.read_csv('temp.csv', index_col=0)
#set columns and index names
df.columns.name = df.index.name
df.index.name = df.index[0]
#remove first row of data
df = df.iloc[1:]
print (df)
B 2 7
A
1 3.0 NaN
6 NaN 8.0

How to rename the rows in dataframe using pandas read (Python)?

I want to rename rows in python program (version - spyder 3 - python 3.6) . At this point I have something like that:
import pandas as pd
data = pd.read_csv(filepath, delim_whitespace = True, header = None)
Before that i wanted to rename my columns:
data.columns = ['A', 'B', 'C']
It gave me something like that.
A B C
0 1 n 1
1 1 H 0
2 2 He 1
3 3 Be 2
But now, I want to rename rows. I want:
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
How can I do it? The main idea is to rename every row created by pd.read by the data in the B column. I tried something like this:
for rows in data:
data.rename(index={0:'df.loc(index, 'B')', 1:'one'})
but it's not working.
Any ideas? Maybe just replace the data frame rows by column B? How?
I think need set_index with rename_axis:
df1 = df.set_index('B', drop=False).rename_axis(None)
Solution with rename and dictionary:
df1 = df.rename(dict(zip(df.index, df['B'])))
print (dict(zip(df.index, df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
If default RangeIndex solution should be:
df1 = df.rename(dict(enumerate(df['B'])))
print (dict(enumerate(df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
Output:
print (df1)
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
EDIT:
If dont want column B solution is with read_csv by parameter index_col:
import pandas as pd
temp=u"""1 n 1
1 H 0
2 He 1
3 Be 2"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), delim_whitespace=True, header=None, index_col=[1])
print (df)
0 2
1
n 1 1
H 1 0
He 2 1
Be 3 2
I normally rename my rows in my dataset by following these steps.
import pandas as pd
df=pd.read_csv("zzzz.csv")
#in a dataframe it is hard to change the names of our rows so,
df.transpose()
#this changes all the rows to columns
df.columns=["","",.....]
# make sure the length of this and the length of columns are same ie dont skip any names.
#Once you are done renaming them:
df.transpose()
#We get our original dataset with changed row names.
just put colnames into "names" when reading
import pandas as pd
df = pd.read_csv('filename.csv', names=["colname A", "colname B"])

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

How to iterate over DataFrame and generate a new DataFrame

I have a data frame looks like this:
P Q L
1 2 3
2 3
4 5 6,7
The objective is to check if there is any value in L, if yes, extract the value on L and P column:
P L
1 3
4,6
4,7
Note there might more than one values in L, in the case of more than 1 value, I would need two rows.
Bellow is my current script, it cannot generate the expected result.
df2 = []
ego
other
newrow = []
for item in data_DF.iterrows():
if item[1]["L"] is not None:
ego = item[1]['P']
other = item[1]['L']
newrow = ego + other + "\n"
df2.append(newrow)
data_DF2 = pd.DataFrame(df2)
First, you can extract all rows of the L and P columns where L is not missing like so:
df2 = df[~pd.isnull(df.L)].loc[:, ['P', 'L']].set_index('P')
Next, you can deal with the multiple values in some of the remaining L rows as follows:
df2 = df2.L.str.split(',', expand=True).stack()
df2 = df2.reset_index().drop('level_1', axis=1).rename(columns={0: 'L'}).dropna()
df2.L = df2.L.str.strip()
To explain: with P as index, the code splits the string content of the L column on ',' and distributes the individual elements across various columns. It then stacks the various new columns into a single new column, and cleans up the result.
First I extract multiple values of column L to new dataframe s with duplicity index from original index. Remove unnecessary columns L and Q. Then output join to original df and drop rows with NaN values.
print df
P Q L
0 1 2 3
1 2 3 NaN
2 4 5 6,7
s = df['L'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'L'
print s
0 3
2 6
2 7
Name: L, dtype: object
df = df.drop( ['L', 'Q'], axis=1)
df = df.join(s)
print df
P L
0 1 3
1 2 NaN
2 4 6
2 4 7
df = df.dropna().reset_index(drop=True)
print df
P L
0 1 3
1 4 6
2 4 7
I was solving a similar issue when I needed to create a new dataframe as a subset of a larger dataframe. Here's how I went about generating the second dataframe:
import pandas as pd
df2 = pd.DataFrame(columns=['column1','column2'])
for i, row in df1.iterrows():
if row['company_id'] == 12345 or row['company_id'] == 56789:
df2 = df2.append(row, ignore_index = True)

Categories

Resources