is there a way to create either data frames or series in python from an excel sheet that has multiple rows and columns such as
and expect the output to be all in one column
tried different codes for data frames and series non did what I expected and series are doing letter by letter in the code that I used
import numpy as np
sr=pd.read_excel('eng.xlsx')
s1=pd.Series(sr, expand=True)
print s1
Use DataFrame.stack with remove MultiIndex by Series.reset_index with drop=True:
s1 = sr.stack().reset_index(drop=True)
Or convert values to numpy array with numpy.ravel or numpy.flatten:
s1 = pd.Series(sr.values.ravel())
s1 = pd.Series(sr.values.flatten())
Sample:
sr = pd.DataFrame({
'A':list('ab'),
'B':list('cd'),
'C':list('ef'),
})
print (sr)
A B C
0 a c e
1 b d f
s1 = sr.stack().reset_index(drop=True)
print (s1)
0 a
1 c
2 e
3 b
4 d
5 f
dtype: object
Related
I have dataframe df:
0
0 a
1 b
2 c
3 d
4 e
O/P should be:
a b c d e
0
1
2
3
4
5
I want column containing(a, b,c,d,e) as header of my dataframe.
Could anyone help?
If your dataframe is pandas and its name is df. Try solving it with pandas:
Firstly convert initial df content to a list, afterwards create a new dataframe defining its columns with the list.
import pandas as pd
list = df[0].tolist() #df[0] is getting the content of first column
dfSolved = pd.DataFrame([], columns = list)
You may provide more details like the index and values of the expected output, the operation you wanna do, etc, so that we could give a specific solution to your case
Here is the solution:
import pandas as pd
import io
import numpy as np
data_string = """ columns_name
0 a
1 b
2 c
3 d
4 e
"""
df = pd.read_csv(io.StringIO(data_string), sep='\s+')
# Solution
df_result = pd.DataFrame(data=[[np.nan]*5],
columns=df['columns_name'].tolist())
I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present
IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)
Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)
I have the following dataframe:
df = pd.DataFrame([['A', 1],['B', 2],['C', 3]], columns=['index', 'result'])
index
result
A
1
B
2
C
3
I would like to create a new column, for example multiply the column 'result' by two, and I am just curious to know if there is a way to do it in pandas as pyspark does it.
In pyspark:
df = df\
.withColumn("result_multiplied", F.col("result")*2)
I don't like the fact of writing the name of the dataframe everytime I have to perform an operation as it is done in pandas such as:
In pandas:
df['result_multiplied'] = df['result']*2
Use DataFrame.assign:
df = df.assign(result_multiplied = df['result']*2)
Or if column result is processing in code before is necessary lambda function for processing counted values in column result:
df = df.assign(result_multiplied = lambda x: x['result']*2)
Sample for see difference column result_multiplied is count by multiple original df['result'], for result_multiplied1 is used multiplied column after mul(2):
df = df.mul(2).assign(result_multiplied = df['result']*2,
result_multiplied1 = lambda x: x['result']*2)
print (df)
index result result_multiplied result_multiplied1
0 AA 2 2 4
1 BB 4 4 8
2 CC 6 6 12
I'm new in Python, I need to get many variables in multiple dataframes:
I wrote this code but I need a long time to configure it for many excersises.
This is the code:
import pandas as pd
df = pd.concat([df1[df1.columns[0]], df2[df1.columns[0]], df1[df1.columns[1]],
df2[df1.columns[1]], df1[df1.columns[2]], df2[df1.columns[2]],
df1[df1.columns[3]], df2[df1.columns[3]], df1[df1.columns[4]],
df2[df1.columns[4]], df1[df1.columns[5]], df2[df1.columns[5]],
df1[df1.columns[6]], df2[df1.columns[6]]], axis=1)
The number of dataframes and columns can be much bigger. Thanks.
It looks like what you're trying to do is: for all of the columns in one dataframe, combine the columns from that dataframe with those from another with the same columns, into a single dataframe with two of every column in the same original order.
In your case:
df1 = DataFrame([['a','b','c'], ['d','e','f']])
df2 = DataFrame([['g','h','i'], ['j','k','l']])
df = concat([s for ss in [(df1[c], df2[c]) for c in df1.columns] for s in ss], axis=1)
print(df)
Result:
0 0 1 1 2 2
0 a g b h c i
1 d j e k f l
I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9
You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.
The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9
A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)