Splitting a DataFrame into an Array Using Numpy - python

I have a file called data that looks like this:
Some Text Information (lines 1-6 in file)
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
What I'm trying to achieve is this something like this:
[[ 22. 23.]
[ 44. 44.]
[ 55. 55.]
[ 66. 66.]
[ 77. 77.]]
The issue I'm having is that the code I'm using doesn't properly split the data from the file. It ends up looking like this:
[ 1 22 23
0 2 44 44
1 3 55 55, Empty DataFrame
Columns: [1 6734 1453]
Index: [], 1 22 23
2 4 44 44
3 5 55 55
4 6 66 66
5 7 77 77
EOF]
Here's the code I'm using:
def loadFile(filename):
df1 = pd.read_fwf(filename, skiprows=6)
df1 = np.split(df, [2,2])
print('The data points:\n {}'.format(df1[:5]))
I understand the parameters of the split function. For instance, [2,2] should create two sub arrays from my dataframe and my axis is 0. However, why does it not properly split the array?

You can read file into pandas dataFrame and access the values attribute from it. Assuming "Some Text Information" is not the header:
import pandas as pd
df = pd.read_table(filepath, sep='\t', index_col= 0, skiprows = 6, header = None)
df.values # gives you the numpy ndarray
This should use the first column as index. Also you might need to remove the sep argument to let read_table figure it out. Also, try using other separators. If you get the row index in your data then try slicing to get desired results. Use something like:
df.iloc[:,1:].values

Do not use read_fwf, let pandas figure out the structure of your table:
df = pd.read_csv("yourfile", skiprows=6, header=None, sep='\s+')

To elaborate on ManKind_008's answer:
Your explicit line numbers are the problem. Pandas interprets these as valid data.
Using ManKinds solution does properly set the index column, but since your line numbers start at zero you end up with a DataFrame like:
pd.read_fwf('test.csv', header=None, index_col=0, skiprows=6)
1 2
0
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
Instead I suggest you read in all of your data using:
pd.read_fwf('test.csv', header=None, skiprows=6).iloc[:, 1:]
1 2
0 22 23
1 44 44
2 55 55
3 66 66
4 77 77
This leaves you with what you seem to need. The iloc call is ignoring the first row of data (your line numbers).
From here the df.values command will give you:
array([[22, 23],
[44, 44],
[55, 55],
[66, 66],
[77, 77]])
If you don't want a np.array, you can explicitly cast this to a list using the list() function.

Related

Select columns based on a range and single integer using iloc()

Given the data frame below I want to select columns A and D to F.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(5, 6)), columns=list('ABCDEF'))
In R I could say
df[ , c(1, 4:6)]
I want something similar for pandas. With
df.iloc[:, slice(3, 6, 1)]
I get columns D to F. But how can I add column A?
You use np.r_ to pass combinations of slices and indices. Since you seem to know the labels, you could can use get_loc to obtain the iloc indices.
import numpy as np
idx = np.r_[df.columns.get_loc('A'),
df.columns.get_loc('D'):df.columns.get_loc('F')+1]
# idx = np.r_[0, 3:6]
df.iloc[:, idx]
A D E F
0 38 71 62 63
1 60 93 72 94
2 57 33 30 51
3 88 54 21 39
4 0 53 41 20
Another option is np.split
df_split=pd.concat([np.split(df, [1],axis=1)[0],np.split(df,[3],axis=1)[1]],axis=1)
There you don't need to know the column names, just their positions.

Defining a function that converts a list of lists into separate lists?

I am trying to create a defined function using a for-loop that receives as input a list of lists, and converts them to separate lists.
def convert_to_sep_lists(listoflists):
for i in range(len(listoflists)):
newlst=listoflists[i]
This would obviously return the very last list in the list of lists. How can I save every iteration and return all the lists (within that list) separately?
you can try pandas to_json function
import pandas as pd
import numpy as np
n=20
columns_name = list('abcd')
df = pd.DataFrame(data = np.random.randint(1,100,size=(5,4)),
columns= columns_name)
print(df)
df.sum().to_json("result.json")
The dataframe df content will be:
a b c d
0 56 91 65 82
1 63 65 50 78
2 46 43 75 3
3 37 96 84 13
4 40 59 61 66
the output file content will be
{"a":165,"b":230,"c":234,"d":336}

Manipulating a single row DataFrame object to a 6x6 DataFrame

What I want to do is to do is pretty explained in the title, but for good measure here is my problem:
For the sake of the example let's say that I have a Google Form with 36 questions and I wanted to manipulate that row of answers to a dataframe using Python 3. Problem is that I get an error, but I'm getting ahead of myself. Here is what I've tried:
#
from flask import Flask, render_template, request
import pandas as pd
import numpy as np
io_table=pd.DataFrame(np.random.random_sample((1,36)))
fctr_column=pd.DataFrame(np.random.random_sample((6)))
io_table=pd.DataFrame(io_table) #Convert list to DataFrame
io_t=io_table
factor=fctr_column
test=pd.DataFrame()
for i in range(0,io_table.shape[1]+1):
test=io_table.loc[0,i+1:i+6], ignore_index=True
i=i+6
print(test)
And, as I mentioned before, I got an error:
File "path/to/temp.py", line 29, in <module>
test=io_table.loc[0,i+1:i+6], ignore_index=True
TypeError: cannot unpack non-iterable bool object
Now, I don't know what to do. Can anyone provide a solution?
EDIT: Expected input and output
Not sure if I got you right, but if you have a DataFrame with 36 values you can reshape it using something like in the following example:
import pandas as pd
a = range(1, 37)
df = pd.DataFrame(a).T
df.values.reshape((6,6))
#[[ 1 2 3 4 5 6]
# [ 7 8 9 10 11 12]
# [13 14 15 16 17 18]
# [19 20 21 22 23 24]
# [25 26 27 28 29 30]
# [31 32 33 34 35 36]]

How to get rolling pandas dataframe subsets

I would like to get dataframe subsets in a "rolling" manner.
I tried several things without success, here is an example of what I would like to do. Let's consider dataframe.
df
var1 var2
0 43 74
1 44 74
2 45 66
3 46 268
4 47 66
I would like to create a new column with the following function which performs a conditional sum:
def func(x):
tmp = (x["var1"] * (x["var2"] == 74)).sum()
return tmp
and calling it like this
df["newvar"] = df.rolling(2, min_periods=1).apply(func)
That would mean that the function would be applied on dataframe basis, and not for each row or column
It would return
var1 var2 newvar
0 43 74 43 # 43
1 44 74 87 # 43 * 1 + 44 * 1
2 45 66 44 # 44 * 1 + 45 * 0
3 46 268 0 # 45 * 0 + 46 * 0
4 47 66 0 # 46 * 0 + 47 * 0
Is there a pythonic way to do this?
This is just an example but the condition (always based on the sub-dataframe values depends on more than 2 columns.
updated comment
#unutbu posted a great answer to a very similar question here but it appears that his answer is based on pd.rolling_apply which passes the index to the function. I'm not sure how to replicate this with the current DataFrame.rolling.apply method.
original answer
It appears that the variable passed to the argument through the apply function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.
But what you can do is use some boolean logic to temporarily create a new column based on whether var2 is 74 or not and then use the rolling method.
df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum()
var1 var2 new_var
0 43 74 43.0
1 44 74 87.0
2 45 66 44.0
3 46 268 0.0
4 47 66 0.0
The temporary column is based on the first half of the code above.
df.var2.eq(74).mul(df.var1)
# or equivalently with operators
# (df['var2'] == 74) * df['var1']
0 43
1 44
2 0
3 0
4 0
Finding the type of the variable passed to apply
Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.
def foo(x):
print(x)
print(type(x))
return x.sum()
df.rolling(2, min_periods=1).apply(foo)
Output
[ 43.]
<class 'numpy.ndarray'>
[ 43. 44.]
<class 'numpy.ndarray'>
[ 44. 45.]
<class 'numpy.ndarray'>
[ 45. 46.]
<class 'numpy.ndarray'>
[ 46. 47.]
<class 'numpy.ndarray'>
[ 74.]
<class 'numpy.ndarray'>
[ 74. 74.]
<class 'numpy.ndarray'>
[ 74. 66.]
<class 'numpy.ndarray'>
[ 66. 268.]
<class 'numpy.ndarray'>
[ 268. 66.]
<class 'numpy.ndarray'>
The trick is to define a function that has access to your entire dataframe. Then you do a roll on any column and call apply() passing in that function. The function will have access to the window data, which is a subset of the dataframe column. From that subset you can extract the index you should be looking at. (This assumes that your index is strictly increasing. So the usual integer index will work, as well as most time series.) You can use the index to then access the entire dataframe with all the columns.
def dataframe_roll(df):
def my_fn(window_series):
window_df = df[(df.index >= window_series.index[0]) & (df.index <= window_series.index[-1])]
return window_df["col1"] + window_df["col2"]
return my_fn
df["result"] = df["any_col"].rolling(24).apply(dataframe_roll(df), raw=False)
Here's how you get dataframe subsets in a rolling manner:
for df_subset in df.rolling(2):
print(type(df_subset), '\n', df_subset)

getting a subset of arrays from a pandas data frame

I have a numpy array named arr with 1154 elements in it.
array([502, 502, 503, ..., 853, 853, 853], dtype=int64)
I have a data frame called df
team Count
0 512 11
1 513 21
2 515 18
3 516 8
4 517 4
How do I get the subset of the data frame df that includes the values only from the array arr
for eg:
team count
arr1_value1 45
arr1_value2 67
To make this question more clear:
I have a numpy array ['45', '55', '65']
I have a data frame as follows:
team count
34 156
45 189
53 90
65 99
23 77
55 91
I need a new data frame as follows:
team count
45 189
55 91
65 99
I don't know if that is a typo or not where your array values look like strings, assuming it is not and they are in fact ints then you can filter your df by calling isin:
In [6]:
a = np.array([45, 55, 65])
df[df.team.isin(a)]
Out[6]:
team count
1 45 189
3 65 99
5 55 91
You can use the DataFrame.loc method
Using your example (Notice that team is the index):
arr = np.array(['45', '55', '65'])
frame = pd.DataFrame([156, 189, 90, 99, 77, 91], index=['34', '45', '53', '65', '23', '55'])
ans = frame.loc[arr]
This sort of indexing is type sensitive, so if the frame.index is int then make sure your indexing array is also of type int, and not str like in this example.
I am answering the question asked after "To make this question more clear".
As a side note: the first 4 lines could have been provided by you, so I would not have to type them myself, which could also introduce errors/misunderstanding.
The idea is to create a Series as Index and then simply create a new dataframe based on that index. I just started with pandas, maybe this can be done more efficiently.
import numpy as np
import pandas as pd
# starting with the df and teams as string
df = pd.DataFrame(data={'team': [34, 45, 53, 65, 23, 55], 'count': [156, 189, 90, 99, 77, 91]})
teams = np.array(['45', '55', '65'])
# we want the team number as int
teams_int = [int(t) for t in teams]
# mini function to check, if the team is to be kept
def filter_teams(x):
return True if x in teams_int else False
# create the series as index and only keep those values from our original df
index = df['team'].apply(filter_teams)
df_filtered = df[index]
It returns this dataframe:
count team
1 189 45
3 99 65
5 91 55
Note that in this case, the df_filtered uses 1, 3, 5 as index (the indices sof the original dataframe). Your question is unclear about this, as the index is not shown to us.

Categories

Resources