Convert Outline format in CSV to Two Columns - python

I have data in a CSV file of the following format (one column in a dataframe). This is essentially like an outline in a Word document, where the headers I've shown here are letters are the main headers, and the items as numbers are subheaders:
A
1
2
3
B
1
2
C
1
2
3
4
I want to convert this to the following format (two columns in a dataframe):
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I'm using pandas read_csv to convert the data into a dataframe, and I'm trying to reformat through for loops, but I'm having difficulty because the data repeats and gets overwritten. For example, A 3 will get overwritten with C 3 (resulting in two instance of C 3 when only one is desired, and losing A 3 altogether) later in the loop. What's the best way to do this?
Apologies for poor formatting, new to the site.

Use:
#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
df = df[df['a'] != df['col']]
print (df)
a col
1 A 1
2 A 2
3 A 3
5 B 1
6 B 2
8 C 1
9 C 2
10 C 3
11 C 4
Details:
Check isnumeric values:
print (df['col'].str.isnumeric())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 True
9 True
10 True
11 True
Name: col, dtype: bool
Replace True by NaNs by mask and forward fill missing values:
print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0 A
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
11 C
Name: col, dtype: object
Add new column to first position by DataFrame.insert:
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
a col
0 A A
1 A 1
2 A 2
3 A 3
4 B B
5 B 1
6 B 2
7 C C
8 C 1
9 C 2
10 C 3
11 C 4
and last remove rows with same values by boolean indexing.

Related

Drop column with low variance in pandas

I'm trying to drop columns in my pandas dataframe with 0 variance.
I'm sure this has been answered somewhere but I had a lot of trouble finding a thread on it. I found this thread, however when I tried the solution for my dataframe, baseline with the command
baseline_filtered=baseline.loc[:,baseline.std() > 0.0]
I got the error
"Unalignable boolean Series provided as "
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
So, can someone tell me why I'm getting this error or provide an alternative solution?
There are some non numeric columns, so std remove this columns by default:
baseline = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,1,1,1,1,1],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
#no A, F columns
m = baseline.std() > 0.0
print (m)
B True
C True
D False
E True
dtype: bool
So possible solution for add or remove strings columns is use DataFrame.reindex:
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=True) ]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=False) ]
print (baseline_filtered)
B C E
0 4 7 5
1 5 8 3
2 4 9 6
3 5 4 9
4 5 2 2
5 4 3 4
Another idea is use DataFrame.nunique working with strings and numeric columns:
baseline_filtered=baseline.loc[:,baseline.nunique() > 1]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!
If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

Filling missing data in df.loc filtered conditions?

I have following problem with filling nan in a filtered df.
Let's take this df :
condition value
0 A 1
1 B 8
2 B np.nan
3 A np.nan
4 C 3
5 C np.nan
6 A 2
7 B 5
8 C 4
9 A np.nan
10 B np.nan
11 C np.nan
How can I fill np.nan with the value from the last value based on condition, so that I get following result?
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
I've failed with following code (ValueError: Cannot index with multidimensional key):
conditions = set(df['condition'].tolist())
for c in conditions :
filter = df.loc[df['condition'] == c]
df.loc[filter, 'value'] = df.loc[filter, 'value'].fillna(method='ffill')
THX & BR from Vienna
If your values are actual NaN, you simply need to do a groupby on condition, and then call ffill (which is essentially a wrapper for fillna(method='ffill')):
df.groupby('condition').ffill()
Which returns:
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
If your values are strings that say np.nan, as in your example, then replace them before:
df.replace('np.nan', np.nan, inplace=True)
df.groupby('condition').ffill()

Compare two columns in pandas to make them match

So I have two dataframes consisting of 6 columns each containing numbers. I need to compare 1 column from each dataframe to make sure they match and fix any values in that column that don't match. Columns are already sorted and they match in terms of length. So far I can find the differences in the columns:
df1.loc[(df1['col1'] != df2['col2'])]
then I get the index # where df1 doesn't match df2. Then I'll go to that same index # in df2 to find out what value in col2 is causing a mismatch then use this to change the value to the correct one found in df2:
df1.loc[index_number, 'col1'] = new_value
Is there a way I can automatically fix the mismatches without having to manually look up what the correct value should be in df2?
if df2 is the authoritative source, you don't need to check where df1 is equal
df1.loc[:, 'column_name'] = df2['column_name']
But if we must check
c = 'column_name'
df1.loc[df1[c] != df2[c], c] = df2[c]
I think you need compare by eq and then if need add value where dont match use combine_first:
df1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,6,5],
'E':[5,3,6],
'F':[1,4,3]})
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3
df2 = pd.DataFrame({'A':[1,2,1],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 1 6 9 5 6 3
If need compare one column with all DataFrame:
print (df1.eq(df2.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 False False False False False False
print (df1.eq(df1.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 True False False False False True
And if need same column D:
df1.D = df1.loc[df1.D.eq(df2.D), 'D'].combine_first(df2.D)
print (df1)
A B C D E F
0 1 4 7 1.0 5 1
1 2 5 8 3.0 3 4
2 3 6 9 5.0 6 3
But then is easier only assign column D from df2 to D of df1:
df1.D = df2.D
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If indexes are different, is possible use values for convert column to numpy array:
df1.D = df1.D.values
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3

Python value difference in dataframe by group key

I have a DataFrame
name value
A 2
A 4
A 5
A 7
A 8
B 3
B 4
B 8
C 1
C 3
C 5
And I want to get the value differences based on each name
like this
name value dif
A 2 0
A 4 2
A 5 1
A 7 2
A 8 1
B 3 0
B 4 1
B 8 4
C 1 0
C 3 2
C 5 2
Can anyone show me the easiest way?
You can use GroupBy.diff to compute the difference between consecutive rows per grouped object. Optionally, filling missing values( first row in every group) by 0 and casting them finally as integers.
df['dif'] = df.groupby('name')['value'].diff().fillna(0).astype(int)
df

Categories

Resources