Reading data from text file with variable numbers of Column - python

I am reading data from a text file in python using pandas. There are no header values (column names) assigned to the data in the text file. I want to reshape the data into a readable form. The problem i am facing is variable column lengths
For example in my text file i have
1,2,3,4,5,Hello,7,8
1,2,3,4,5,7,8,
1,2,3,4,5,7,8,
1,2,3,4,5,Hello,7,8,
Now when i create a data frame I want to make sure that in the second row instead of Hello a "NAN" is written as the value for that column is not present. and in the end after giving column names and rearranging the data frame will look like.
1,2,3,4,5,Hello,7,8
1,2,3,4,5,"NA,"7,8,
1,2,3,4,5,"NA",7,8,
1,2,3,4,5,Hello,7,8,

Answer to updated question and also a generalized solution for such case.
focus_col_idx = 5 # The column where you want to bring NaN in expected output
last_idx = df.shape[1] - 1
# Fetching the index of rows which have None in last column
idx = df[df[last_idx].isnull()].index
# Shifting the column values for those rows with index idx
df.iloc[idx,focus_col_idx+1:] = df.iloc[idx,focus_col_idx:last_idx].values
# Putting NaN for second column where row index is idx
df.iloc[idx,focus_col_idx] = np.NaN
df
+---+----+---+---+---+---+-------+---+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+---+---+---+---+-------+---+-----+
| 0 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
| 1 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 2 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 3 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
+---+----+---+---+---+---+-------+---+-----+
Answer to previous data
Assuming only one column is having missing value (say 2nd column as per your previous data). Here's a quick sol -
df = pd.read_table('SO.txt',sep='\,', header=None)
df
+---+---+---+---+---+------+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+------+
| 0 | A | B | C | D | E |
| 1 | A | C | D | E | None |
+---+---+---+---+---+------+
# Fetching the index of rows which have None in last column
idx = df[df[4].isnull()].index
idx
# Int64Index([1], dtype='int64')
# Shifting the column values for those rows with index idx
df.iloc[idx,2:] = df.iloc[idx,1:4].values
df
+---+---+---+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | C | C | D | E | # <- Notice the shifting.
+---+---+---+---+---+---+
# Putting NaN for second column where row index is idx
df.iloc[idx,1] = np.NaN
# Final output
df
+---+---+-----+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+-----+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | NaN | C | D | E |
+---+---+-----+---+---+---+

Related

Python Dataframe Checking for a value then adding values to another DataFrame

I have a dataframe with a column with either a 1 or 0 in it.
This is the Signal column.
I want to cycle through this dataframe until I get to the first 1 then take the value in the Open column and put that into another Dataframe Total, column Buy
Then as it continues through the dataframe when it reaches the first 0 then take that value in the Open column and put that into the same Dataframe Total, column Sold.
I know I need a loop within a loop but I'm not getting very far!
Any pointers/help would be appreciated!
Total = DataFrame()
for i in range(len(df)) :
if i.Signal == 1 :
Total['Buy'] = i.Open
if i.Signal == 0:
Total['Sold'] = i.Open
I know the code is wrong!
Cheers
Example of DataFrame
df = pd.DataFrame({'Signal': [0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,1,1,0,0], 'Open': np.random.rand(20)})
>>> df
| | Signal | Open |
|---:|---------:|----------:|
| 0 | 0 | 0.959061 |
| 1 | 0 | 0.820516 |
| 2 | 1 | 0.0562783 |
| 3 | 1 | 0.612508 |
| 4 | 1 | 0.288703 |
| 5 | 1 | 0.332118 |
| 6 | 0 | 0.949236 |
| 7 | 0 | 0.20909 |
| 8 | 1 | 0.574924 |
| 9 | 1 | 0.170012 |
| 10 | 1 | 0.0255655 |
| 11 | 1 | 0.788829 |
| 12 | 0 | 0.39462 |
| 13 | 0 | 0.493338 |
| 14 | 0 | 0.347471 |
| 15 | 1 | 0.574096 |
| 16 | 1 | 0.286367 |
| 17 | 1 | 0.131327 |
| 18 | 0 | 0.38943 |
| 19 | 0 | 0.592241 |
# get the position of the first 1
first_1 = (df['Signal']==1).idxmax()
# Create a mask with True in the position of the first 1
# and every time a different value appears (0 after a 1, or 1 after a 0)
mask = np.full(len(df), False)
mask[first_1] = True
for i in range (first_1 + 1, len(df)):
mask[i] = df['Signal'][i] != df['Signal'][i-1]
>>> df[mask]
| | Signal | Open |
|---:|---------:|----------:|
| 2 | 1 | 0.0562783 |
| 6 | 0 | 0.949236 |
| 8 | 1 | 0.574924 |
| 12 | 0 | 0.39462 |
| 15 | 1 | 0.574096 |
| 18 | 0 | 0.38943 |
# Create new DF with 'Buy' = odd values of masked df['Open']
# and 'Sold' = even values of masked df['Open']
open_values = df[mask]['Open'].to_numpy()
total = pd.DataFrame({'Buy': [open_values[i] for i in range(0, len(open_values), 2)], 'Sold': [open_values[i] for i in range(1, len(open_values), 2)]})
>>> total
| | Buy | Sold |
|---:|----------:|---------:|
| 0 | 0.0562783 | 0.949236 |
| 1 | 0.574924 | 0.39462 |
| 2 | 0.574096 | 0.38943 |
It works under the assumption that the original df table ends with 0s and not with 1s, i.e. for each first 1 in a row, there must be at least one 0 afterwards.
The assumption makes sense since the objective is to take differences later.
If the last value is 1, it will produce ValueError: All arrays must be of the same length.

Use Python Pandas For Loop to create pivot tables for each column in dataframe

df1:
| A | B | C |ID |
|---|---|---|---|
| 1 | 5 | 2 | Y |
| 4 | 6 | 4 | Z |
df2:
| A | B | C |ID |
|---|---|---|---|
| 2 | 1 | 2 | Y |
| 4 | 6 | 4 | Z |
Merged:
| case | A | B | C |ID |
|------|---|---|---|---|
|before| 1 | 5 | 2 | Y |
|before| 4 | 6 | 4 | Z |
|after | 2 | 1 | 2 | Y |
|after | 4 | 6 | 4 | Z |
desired pivot for column A:
|ID |before|after|
|- |------|-----|
| Y | 1 | 2|
| Z | 4 | 4|
I want to use a for loop to create a pivot table for each column in dfs 1 and 2. The rows of these pivots will be the ID, the columns will be 'case'.
I would like to create a new df for each column's pivot table using a for loop.
Later, I will drop the rows in each pivot table where the before and after values are the same (only keeping the rows where they are different).
I hope I've understood you correctly:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
for c in df.columns.get_level_values(0).unique():
print(df.xs(c, axis=1))
print("-" * 80)
Prints:
case after before
ID
Y 2 1
Z 4 4
--------------------------------------------------------------------------------
case after before
ID
Y 1 5
Z 6 6
--------------------------------------------------------------------------------
case after before
ID
Y 2 2
Z 4 4
--------------------------------------------------------------------------------
EDIT: To add dataframes into a dictionary:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
all_dataframes = {}
for c in df.columns.get_level_values(0).unique():
all_dataframes[c] = df.xs(c, axis=1).copy()
for key, dataframe in all_dataframes.items():
print("Key:", key)
print("DataFrame:", dataframe)
print("-" * 80)

Pandas use replace with a dict only if target column is empty

I create a df:
import pandas as pd
i = pd.DataFrame({1:['a','r','g','a'],2:[7,6,5,""]})
That is:
| 1 | 2 |
|-----|-----|
| a | 7 |
| r | 6 |
| g | 5 |
| a | |
The last entry for a contains an empty string on column 2.
With the following mapping dictionary:
mapping_dict= {"a":'X'}
Current result:
i[2]=i[1].replace(mapping_dict)
| 1 | 2 |
|-----|-----|
| a | X |
| r | 6 |
| g | 5 |
| a | X |
Expected result, only empty column 2 are mapped:
| 1 | 2 |
|-----|-----|
| a | 7 |
| r | 6 |
| g | 5 |
| a | X |
Try this:
i[2] = i[2].replace('',np.NaN).fillna(i[1].map(mapping_dict))
You can simply replace your code of i[2] on the left of assignment statement to i.loc[i[2] == "", 2] to add a filtering condition.
Here, we use .loc to filter the row with the empty string on column 2 and specify column 2 as its second parameter to select the column to update. Then, you can use your current .replace() code to achieve the desired result:
i.loc[i[2] == "", 2] = i[1].replace(mapping_dict)
Result:
print(i)
1 2
0 a 7
1 r 6
2 g 5
3 a X

How to merge rows with same string, but sum up the rows connected

I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |
It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()
Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows

Retrieve the index of a cross section of a dataframe in pandas

Let's say we have a multi-index dataframe df like this:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| x | q | 3 | 4
| y | p | 5 | 6
| y | q | 7 | 8
Now, I can get the index of cross section of the criteria one > 4 like this:
idx = df[df['one'] > 4].index
And then, using it with .ix:
df.ix[idx]
yields a slice of the frame:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| y | p | 5 | 6
| y | q | 7 | 8
Now, I want to do the same but with a cross section by one level on the multi-index. The .xs is useful in this manner:
df.xs('p', level='B')
returns:
| | one | two
| A |-------|-------
|----|-------|-------
| x | 1 | 2
| y | 5 | 6
But this dataframe has different index structure and its index is not a slice of the df index.
So, my question is, what should to in the place of idx, so that the following expression
df.ix[idx]
to yield
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| y | p | 5 | 6
You need to use the argument drop_level and set it to False to keep the index:
In [9]: df.xs('p', level='B', drop_level=False)
Out[9]:
one two
A B
x p 1 2
y p 5 6

Categories

Resources