Python pandas - multiple columns to one series - python

I have an excel spreadsheet with raw data in:
demo-data:
1
2
3
4
5
6
7
8
9
How do I combine all the numbers to one series, so I can start doing math on it. They are all just numbers of the same "kind"

Given your dataframe as df, this function may help df.values.flatten().

You can convert your dataframe to a list and iterate through it to extract and put values into a 1D list:
df = pd.read_excel("data.xls")
lst = df.to_numpy().tolist()
result = []
for row in lst:
for item in row:
result.append(item)

Related

Column of itemIds that are stored as a string in an array. Some arrays contain multiple itemIds and I want to store them in seperare row

I have a column with entries stored as [array, type] and want to convert that into integers. I already manage to convert the itemId strings into integers by:
for i in range(0,len(df)):
if len(df["itemIds"][i])<2:
df['n'][i]=df.itemIds[i][0]
else:
df['n'][i] = df.itemIds[i]
But now I have the problem that some arrays have multiple string entries and I do not know how to create an extra row for those to store all values separate. I am trying to run this for loop:
for i in range(0,len(df)):
if len(df.n[i])>1:
df.loc[-1]=df.iloc[i]
but since the data is quite large it loads forever. Any advice is highly appreciated! Thank you
You could just use explode. For example:
df = pd.DataFrame({
'itemIds' : [10103923431, 1003052070, 935653934, [10040664250, 10076964903, 10106433820, 5551386]],
'other' : range(5, 9)
})
# itemIds other
# 0 10103923431 5
# 1 1003052070 6
# 2 935653934 7
# 3 [10040664250, 10076964903, 10106433820, 5551386] 8
df = df.explode('itemIds', ignore_index=True)
Output:
itemIds other
0 10103923431 5
1 1003052070 6
2 935653934 7
3 10040664250 8
4 10076964903 8
5 10106433820 8
6 5551386 8
Note that with explode you can just leave lists of 1 entry as a list rather than converting them into an int value. This will save some computation.

Convert 2 columns of a pandas dataframe to a list

I know that you can pull out a single column from a datframe to a list by doing this:
newList = df['column1'].tolist()
and that you can convert all values to a list like this:
newList = df.values.tolist()
But is there a way to convert 2 columns from a dataframe to a list so that you get a list that looks like this
Column 1 Column 2
0 apple 9
1 peach 12
and the resulting list is:
[[apple,9],[peach,12]]
Thanks
As per your example, you can convert a pandas DataFrame to a list with df.values.tolist().
If you want just specific columns, you just need to change df in this code to df containing only those columns, as df[[column1, column2, ..., columnN]].values.tolist()
You can use zip:
[list(i) for i in zip(df['Column 1'], df['Column 2'])]
Output
[[apple,9],[peach,12]]
To convert the entire data frame to a list of lists:
lst = df.to_numpy().tolist()

Add array of new columns to Pandas dataframe

How do I append a list of integers as new columns to each row in a dataframe in Pandas?
I have a dataframe which I need to append a 20 column sequence of integers as new columns. The use case is that I'm translating natural text in a cell of the row into a sequence of vectors for some NLP with Tensorflow.
But to illustrate, I create a simple data frame to append:
df = pd.DataFrame([(1, 2, 3),(11, 12, 13)])
df.head()
Which generates the output:
And then, for each row, I need to pass a function that takes in a particular value in the column '2' and will return an array of integers that need to be appended as columns in the the data frame - not as an array in a single cell:
def foo(x):
return [x+1, x+2, x+3]
Ideally, to run a function like:
df[3, 4, 5] = df['2'].applyAsColumns(foo)
The only solution I can think of is to create the data frame with 3 blank columns [3,4,5] , and then use a for loop to iterate through the blank columns and then input them as values in the loop.
Is this the best way to do it, or is there any functions built into Pandas that would do this? I've tried checking the documentation, but haven't found anything.
Any help is appreciated!
IIUC,
def foo(x):
return pd.Series([x+1, x+2, x+3])
df = pd.DataFrame([(1, 2, 3),(11, 12, 13)])
df[[3,4,5]] = df[2].apply(foo)
df
Output:
0 1 2 3 4 5
0 1 2 3 4 5 6
1 11 12 13 14 15 16

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

Occurence frequency from a list against each row in Pandas dataframe

Let say I have a list of 6 integers named ‘base’ and a dataframe of 100,000 rows with 6 columns of integers as well.
I need to create an additional column which show frequency of occurences of the list ‘base’ against each row in the dataframe data.
The sequence of integers both in the list ‘base’ and dataframe are to be ignored in this case.
The occurrence frequency can have a value ranging from 0 to 6.
0 means all 6 integers in list ‘base’ does not match any of 6 columns from a row in the dataframe.
Can anyone shed some light on this please ?
you can try this:
import pandas as pd
# create frame with six columns of ints
df = pd.DataFrame({'a':[1,2,3,4,10],
'b':[8,5,3,2,11],
'c':[3,7,1,8,8],
'd':[3,7,1,8,8],
'e':[3,1,1,8,8],
'f':[7,7,1,8,8]})
# list of ints
base =[1,2,3,4,5,6]
# define function to count membership of list
def base_count(y):
return sum(True for x in y if x in base)
# apply the function row wise using the axis =1 parameter
df.apply(base_count, axis=1)
outputs:
0 4
1 3
2 6
3 2
4 0
dtype: int64
then assign it to a new column:
df['g'] = df.apply(base_count, axis=1)

Categories

Resources