How to convert pandas series with numpy array values to dataframe - python

I have a pandas series which values are numpy array. For simplicity, say
series = pd.Series([np.array([1,2,3,4]), np.array([5,6,7,8]), np.array([9,10,11,12])], index=['file1', 'file2', 'file3'])
file1 [1, 2, 3, 4]
file2 [5, 6, 7, 8]
file3 [9, 10, 11, 12]
How can I expand it to a dataframe of the form df_concatenated:
0 1 2 3
file1 1 2 3 4
file2 5 6 7 8
file3 9 10 11 12
A wider version of the same problem. Actually the series is obtained from a different dataframe of the form:
DataFrame:
0 1
file slide
file1 1 1 2
2 3 4
file2 1 5 6
2 7 8
file3 1 9 10
2 11 12
by grouping on 'file' index with concatenation of columns.
def concat_sublevel(data):
return np.concatenate(data.values)
series = data.groupby(level=[0]).apply(concat_sublevel)
May be somebody see a better way to come from dataframe data to df_concatenated.
Caveat. slide sub-index can have different number of values for different file values. In such a case I need to repeat one of the rows to get the same dimensions in all resulting rows

You can try of using pandas Dataframe from records
pd.DataFrame.from_records(series.values,index=series.index)
Out:
0 1 2 3
file1 1 2 3 4
file2 5 6 7 8
file3 9 10 11 12

Related

Shuffle the rows of a dataframe in python based on a column value, such that the rows with the same column value are together?

Here's the dataframe I have
fruits=pd.DataFrame()
fruits['month']=['jan','feb','feb','march','jan','april','april','june','march','march','june','april']
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
fruits
The rows in the dataframe should be shuffled, but the rows with the same month should appear together. In other words the rows in the dataframe should be shuffled based on the month and then the rows with the same month should be reshuffled amongst one another(2 level shuffle).
the output data frame should look something like this:
fruits_new=pd.DataFrame()
fruits_new['month']=['april','april','april','feb','feb','jan','jan','march','march','march','jun','jun']
fruits_new['fruit']=['cherry','pear','cherry','pear','orange','apple','apple','orange','orange','cherry','pear','apple']
fruits_new['price']=[60,45,60,40,20,30,30,25,25,55,45,37]
fruits_new
You can use pandas.DataFrame.sample and use fraction as 1, it will randomly take the sample from the dataframe rows, and frac=1 will make it take all the rows.
>>> df.sample(frac=1)
SAMPLE RUN:
#Initial dataframe
0 1 2
0 5 6 A
1 5 8 B
2 6 6 C
3 6 9 D
4 5 8 E
>>> df.sample(frac=1)
#After shuffle
0 1 2
0 5 6 A
4 5 8 E
1 5 8 B
3 6 9 D
2 6 6 C

Frequency table with disorganized datas made by pandas

I have an excel archive with different numbers and i open it using pandas.
when i read and then print the xslx archive ,i have something like this:
5 7 7
0 6 16 5
1 10 12 15
2 1 5 6
3 5 6 18
. . . .
. . . .
n . . n
All i need is to distribute them with different intervals according to their frequencies.
my code is
import pandas as pd
excel_archive=pd.read_exceL("file name")
print(excel)
I think excel file has no header, so first add header=None to read_excel and then use DataFrame.stack with Series.value_counts:
excel_archive=pd.read_exceL("file name", header=None)
s = excel_archive.stack().value_counts()
print (s)
5 4
6 3
7 2
15 1
12 1
10 1
18 1
1 1
16 1
dtype: int64
Your question is not very clear but if you just have to count the number of occurrence you can try something like this:
#generate a dataframe
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 4], [7, 8, 9], [1, 5, 2], [7, 9, 9]]),columns=['a', 'b', 'c'])
#Flatten the array
df_flat=df.stack().reset_index(drop=True)
#Count the number of occurences
df_flat.groupby(df_flat).size()
This is the input:
a b c
0 1 2 3
1 4 5 4
2 7 8 9
3 1 5 2
4 7 9 9
And this is the output:
1 2
2 2
3 1
4 2
5 2
7 2
8 1
9 3
If you want instead to divide in some predefined intervals you can use pd.cut together with groupby:
#define intervals
intervals = pd.IntervalIndex.from_arrays([0,3,6],[3,6,9],closed='right')
#cut and groupby
df_flat.groupby(pd.cut(df_flat,intervals)).size()
and the result would be:
(0, 3] 5
(3, 6] 4
(6, 9] 6

How do I convert my 2D numpy array to a pandas dataframe with given categories?

I have an array called 'values' which features 2 columns of mean reaction time data from 10 individuals. The first column refers to data collected for a single individual in condition A, the second for that same individual in condition B:
array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
I would like to plot these data using seaborn, to display the mean values and connected single values for each individual across the two conditions. What is the simplest way to convert the array 'values' into a 3 column DataFrame, whereby one column features all the values, another features a label distinguishing that value as condition A or condition B, and a final column which provides a number for each individual (i.e., 1-10)? For example, as follows:
Value Condition Individual
451.75 A 1
488.56 B 1
488.55 A 2
...etc
melt
You can do that using pd.melt:
pd.DataFrame(data, columns=['A','B']).reset_index().melt(id_vars = 'index')\
.rename(columns={'index':'Individual'})
Individual variable value
0 0 A 451.750000
1 1 A 552.444444
2 2 A 629.875000
3 3 A 454.666667
4 4 A 637.166667
5 5 A 538.833333
6 6 A 463.833333
7 7 A 429.296296
8 8 A 524.666667
9 0 B 488.555556
10 1 B 590.407407
11 2 B 637.629630
12 3 B 421.888889
13 4 B 539.944444
14 5 B 516.333333
15 6 B 448.833333
16 7 B 497.166667
17 8 B 458.833333
This should work
import pandas as pd
import numpy as np
np_array = np.array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
pd_df = pd.DataFrame(np_array, columns=["A", "B"])
num_individuals = len(pd_df.index)
pd_df = pd_df.melt()
pd_df["INDIVIDUAL"] = [(i)%(num_individuals) + 1 for i in pd_df.index]
pd_df
variable value INDIVIDUAL
0 A 451.750000 1
1 A 552.444444 2
2 A 629.875000 3
3 A 454.666667 4
4 A 637.166667 5
5 A 538.833333 6
6 A 463.833333 7
7 A 429.296296 8
8 A 524.666667 9
9 B 488.555556 1
10 B 590.407407 2
11 B 637.629630 3
12 B 421.888889 4
13 B 539.944444 5
14 B 516.333333 6
15 B 448.833333 7
16 B 497.166667 8
17 B 458.833333 9

Write multiple dataframes to a single text file without any delimiters

I have 5 different data frames that I'd like to output to a single text file one after the other.
Because of my specific purpose, I do not want a delimiter.
What is the fastest way to do this?
Example:
Below are 5 dataframes. Space indicates new column.
1st df AAA 1 2 3 4 5 6
2nd BBB 1 2 3 4 5 6 7 8 9 10
3rd CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
CCC 1 2 3 4 5 6 6 7 12 2 3 3 4 51 2
4th DDD 1 2 3 4 5 6 2 3 4 5
5th EEE 1 2 3 4 5 6 7 8 9 10 1 2 2
I'd like convert the above to below in a single text file:
AAA123456
BBB12345678910
CCC12345667122334512
CCC12345667122334512
DDD1234562345
EEE12345678910122
Notice that columns are just removed but rows are preserved with new lines.
I've tried googling around but to_csv seems to require a delimiter and I also came across a few solutions using "with open" and "write" but that seems to require iterating through every row in the dataframe.
Appreciate any ideas!
Thanks,
You can combine the dataframes with pd.concat.
import pandas as pd
df1 = pd.DataFrame({'AAA': [1, 2, 3, 4, 5, 6]}).T
df2 = pd.DataFrame({'BBB': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}).T
df_all = pd.concat([df1, df2])
Output (which you can format as you see necessary):
0 1 2 3 4 5 6 7 8 9
AAA 1 2 3 4 5 6 NaN NaN NaN NaN
BBB 1 2 3 4 5 6 7.0 8.0 9.0 10.0
Write to CSV [edit: with a creative delimiter]:
df_all.to_csv('df_all.csv', sep = ',')
Import the CSV and remove spaces:
with open('df_all.csv', mode = 'r') as file_in:
with open('df_all_no_spaces.txt', mode = 'w') as file_out:
text = file_in.read()
text = text.replace(',', '')
file_out.write(text)
There's gotta be a more elegant way to do that last bit, but this works. Perhaps for good reason, pandas doesn't support exporting to CSV with no delimiter.
Edit: you can write to CSV with commas and then remove them. :)
to gather some data frame in single text file do:
whole_curpos = ''
#read every dataframe
for df in dataframe_list:
#gather all the column in a single column
df['whole_text'] = df[col0].astype(str)+df[col1]+...+df[coln]
for row in range(df.shape[0]):
whole_curpos = whole_curpos + df['whole_text'].iloc[row]
whole_curpos = whole_curpos + '\n'

Issue with reading partial header CSV using pandas.read_csv

I'm trying to read a csv file using pandas.read_csv when the files header is not full, i.e., only some columns have names, others are empty.
When reading the data frame using .iloc I only get the columns which the header do not have any names.
The reason some columns do not have names is that the column size is variable and I did not assign a name for each column.
here's an example of the code, input file and output
dataframe = pandas.read_csv('filename.csv', sep = ",", header = 0)
dataframe = dataframe.iloc[::]
dataset = dataframe.values[:,0:]
input file
A B C
3 5 0 1 2 3
3 5 4 5 6 7
3 5 8 9 10 11
3 5 12 13 14 15
dataset output
dataset = [[1,2,3][5,6,7][9,10,11][13,14,15]]
How can I get dataframe to use the entire array (without the header)?
I think you need .values to get back numpy ndarray.
from io import StringIO
csv_file = StringIO("""A B C
3 5 0 1 2 3
3 5 4 5 6 7
3 5 8 9 10 11
3 5 12 13 14 15""")
df = pd.read_csv(csv_file,sep='\s',engine='python')
df.values
Output:
array([[ 1, 2, 3],
[ 5, 6, 7],
[ 9, 10, 11],
[13, 14, 15]])
Why not skip=1 when you load the csv file ?

Categories

Resources