Convert table having string column, array column to all string columns - python

I am trying to convert a table containing string columns and array columns to a table with string columns only
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |[2,3] | [4,5] |
| 2 |[6,7,8] | [8,9,10] |
+-----+--------------------+--------------------+
How can I get expected result like that:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 | 2 | 4 |
| 1 | 3 | 5 |
| 2 | 6 | 8 |
| 2 | 7 | 9 |
| 2 | 8 | 10 |
+-----+--------------------+--------------------+

The confusion comes from mixing scalar columns and list columns.
Under the assumption that -for every row- col2 and col3 are of the same length, we can first translate all scalar columns into list columns and then concatenate:
df = pd.DataFrame({'col1': [1,2],
'col2': [[2,3] , [6,7,8]],
'col3': [[4,5], [8,9,10]]})
# First, we turn all columns into list columns
df['col1'] = df['col1'].apply(lambda x: [x]) * df['col2'].apply(len)
# Then we concatenate the lists
df.apply(np.concatenate)
Output:
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9
4 2 8 10

Conver the columns to lists and after that to numpy.array, finally convert them to a DataFrame:
vals1 = np.array(df.col2.values.tolist())
vals2 = np.array(df.col3.values.tolist())
col1 = np.repeat(df.col1, vals1.shape[1])
df = pd.DataFrame(np.column_stack((col1, vals1.ravel(), vals2.ravel())), columns=df.columns)
print(df)
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9

Related

Pandas, remove rows based on equivalence on differents columns between them [duplicate]

I am looking for a an efficient and elegant way in Pandas to remove "duplicate" rows in a DataFrame that have exactly the same value set but in different columns.
I am ideally looking for a vectorized way to do this as I can already identify very inefficient ways using the Pandas pandas.DataFrame.iterrows() method.
Say my DataFrame is:
source|target|
----------------
| 1 | 2 |
| 2 | 1 |
| 4 | 3 |
| 2 | 7 |
| 3 | 4 |
I want it to become:
source|target|
----------------
| 1 | 2 |
| 4 | 3 |
| 2 | 7 |
df = df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
source target
0 1 2
2 4 3
3 2 7
explanation:
np.sort(df.values,axis=1) is sorting DataFrame column wise
array([[1, 2],
[1, 2],
[3, 4],
[2, 7],
[3, 4]], dtype=int64)
then making a dataframe from it and checking non duplicated using prefix ~ on duplicated
~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()
0 True
1 False
2 True
3 True
4 False
dtype: bool
and using this as mask getting final output
source target
0 1 2
2 4 3
3 2 7

Pandas Dataframe - record number of rows based on cumulative sum on a column with a condition

In below df I already have column "A". I'm trying to add another column "Desired" where the value is the number of rows below the corresponding A's value, to first satisfy: cumulative sum of A's value >= 8
For example: row 1 of column "Desired" would be 3 because 5+2+3 >= 8. rows 2 of column "Desired" would be 4 because 2+3+2+2>=8
Therefore the ideal new df would be like below.
df:
A
Desired
8
3
5
4
2
4
3
4
2
3
2
2
1
1
11
1
8
NA
6
NA
Use cumsum() and a for loop:
df = pd.DataFrame({'A':[8,5,2,3,2,2,1,11,8,6]})
cumsum_arr = df['A'].cumsum().values
desired = np.zeros(len(df))
for i in range(len(df)):
desired[i] = np.argmax((cumsum_arr[i:] - cumsum_arr[i])>=8)
df['desrired'] = desired
df['desrired'] = df['desrired'].replace(0, np.nan)
A desrired
0 8 3.0
1 5 4.0
2 2 4.0
3 3 4.0
4 2 3.0
5 2 2.0
6 1 1.0
7 11 1.0
8 8 NaN
9 6 NaN
Using rolling() window it can be achieved without any looping.
df = pd.read_csv(io.StringIO("""|A|Desired|
|8 |3 |
|5 |4 |
|2 |4 |
|3 |4 |
|2 |3 |
|2 |2 |
|1 |1 |
|11 |1 |
|8 |NA |
|6 |NA |"""),sep="|")
df = df.drop(columns=[c for c in df.columns if "Unnamed" in c])
df["Desired"] = pd.to_numeric(df["Desired"], errors="coerce").astype("Int64")
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html see example
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=len(df))
df["DesiredCalc"] = (df["A"]
# looking at rows after current row
.shift(-1)
.rolling(indexer, min_periods=1)
# if any result of cumsum()>=8 then return zero based index + 1, else no result
.apply(lambda x: np.where(np.cumsum(x).ge(8).any(), np.argmax(np.cumsum(x).ge(8)) + 1, np.nan))
.astype("Int64")
)
output
A Desired DesiredCalc
8 3 3
5 4 4
2 4 4
3 4 4
2 3 3
2 2 2
1 1 1
11 1 1
8 <NA> <NA>
6 <NA> <NA>

Operation on pandas data frames

I don't know how to describe my problem in words, I just model it
Problem modeling:
Let say we have two dataframes df1, df2 with the same columns
df1
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | -100 | 2 | -100
df2
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 12 | 23 | 34 | 45
Given these two df-s we get
df_result
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | 23 | 2 | 45
I.e. we get df1 where all -100 substituted with values from df2 accordingly.
Question: How can I do it without for-loop? In particular, is there an operation in pandas or on two lists of the same size that could do what we need?
PS: I can do it with for loop but it will be much slower.
You can use this:
df1[df1==-100] = df2
This is how it works step-by-step:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1,-100,2,-100],[-100,3,-100,-100]]), columns=['col1','col2','col3','col4'])
df1
col1 col2 col3 col4
1 -100 2 -100
-100 3 -100 -100
df2 = pd.DataFrame(np.array([[12,23,34,45],[1,2,3,4]]), columns=['col1','col2','col3','col4'])
df2
col1 col2 col3 col4
12 23 34 45
1 2 3 4
By using boolean indexing you have that
df1==-100
col1 col2 col3 col4
False True False True
True False True True
So when True you can assign the corresponding value of df2:
df1[df1==-100]=df2
df1
col1 col2 col3 col4
1 23 2 45
1 3 3 4

Concatenate DataFrames in ROW

I want to merge two dataframes in pandas one is 126720 rows and 3 columns. The other is 1280 rows and 3 columns. The columns name for both datadrame are same. I tried to merge them in rows but the result has 'NaN' in the values because the number of rows in two dataframes are not same. The way i want them to be merged is to be placed under each other. In this way there shouldn't be any 'NaN'. In other words i want to have a dataframe with 128000 rows and 3 columns.Can anyone point me in a direction how to do it?
Here is the code i tried and led me to 'NaN':
df = pd.read_csv('.csv')
df1 = pd.read_csv('.csv')
result = pd.concat([df1, df], ignore_index=True)
Now dataframe is 128000 rows and 4 columns. Here is the screenshot of part of 'df' and 'df1'.
Hint: I have a large dataset so i can not screenshot the whole dataset!
Showing smaller size of df1 and df (Just an example):
df: col1 | col2 | col3
11 | 13 | 15
------------------
12 | 14 | 16
df1: col1 | col2 | col3
1 | 3 | 6
--------------------
2 | 4 | 7
--------------------
3 | 5 | 8
What i want after merging:
result: col1 | col2 | col3
1 | 3 | 6
------------------
2 | 4 | 7
------------------
3 | 5 | 8
------------------
11 | 13 | 15
------------------
12 | 14 | 16
I think my problem is row indexing for both dataframe is same. That is why i am getting 'NaN' when i am merging them. So how can i change the index of row before merging?

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Categories

Resources