Pandas dataframe to numpy array [duplicate] - python

This question already has answers here:
Convert pandas dataframe to NumPy array
(15 answers)
Closed 3 years ago.
I am very new to Python and have very little experience. I've managed to get some code working by copying and pasting and substituting the data I have, but I've been looking up how to select data from a dataframe but can't make sense of the examples and substitute my own data in.
The overarching goal: (if anyone could actually help me write the entire thing, that would be helpful, but highly unlikely and probably not allowed)
I am trying to use scipy to fit the curve of a temperature change when two chemicals react. There are 40 trials. The model I am hoping to use is a generalized logistic function with six parameters. All I need are the 40 functions, and nothing else. I have no idea how to achieve this, but I will ask another question when I get there.
The current issue:
I had imported 40 .csv files, compiled/shortened the data into 2 sections so that there are 20 trials in 1 file. Now the data has 21 columns and 63 rows. There is a title in the first row for each column, and the first column is a consistent time interval.
However, each trial is not necessarily that long. One of them does, though. So I've managed to write the following code for a dataframe:
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
print(df)
It prints the table out, but as expected, there are NaNs where there exists no data.
So I would like to know how to arrange it into workable array with 2 columns , time and a trial like an (x,y) for a graph for future workings with numpy or scipy such that the rows that there is no data would not be included.
Part of the .csv file begins after the horizontal line. I'm too lazy to put it in a code block, sorry. Thank you.
time,1mnaoh trial 1,1mnaoh trial 2,1mnaoh trial 3,1mnaoh trial 4,2mnaoh trial 1,2mnaoh trial 2,2mnaoh trial 3,2mnaoh trial 4,3mnaoh trial 1,3mnaoh trial 2,3mnaoh trial 3,3mnaoh trial 4,4mnaoh trial 1,4mnaoh trial 2,4mnaoh trial 3,4mnaoh trial 4,5mnaoh trial 1,5mnaoh trial 2,5mnaoh trial 3,5mnaoh trial 4
0.0,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.0,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.3,24.3,24.1,24.1
0.5,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.1,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.4,24.3,24.1,24.1
1.0,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.3,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.5,24.3,24.1,24.1
1.5,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.4,22.8,23.4,23.3,24.0,23.0,23.8,23.8,23.9,23.6,24.3,24.1,24.1
2.0,23.3,23.2,23.2,24.2,23.6,23.2,24.3,22.5,23.0,23.7,24.4,24.1,23.1,23.9,24.4,24.2,23.7,24.5,24.7,25.1
2.5,24.0,23.5,23.5,25.4,25.3,23.3,26.4,22.7,23.5,25.8,27.9,25.1,23.1,23.9,27.4,26.8,23.8,27.2,26.7,28.1
3.0,25.4,24.4,24.1,26.5,27.8,23.3,28.5,22.8,24.6,28.6,31.2,27.2,23.2,23.9,30.9,30.5,23.9,31.4,29.8,31.3
3.5,26.9,25.5,25.1,27.4,29.9,23.4,30.1,22.9,26.4,31.4,34.0,30.0,23.3,24.2,33.8,34.0,23.9,35.1,33.2,34.4
4.0,27.8,26.5,26.2,27.9,31.4,23.4,31.3,23.1,28.8,34.0,36.1,32.6,23.3,26.6,36.0,36.7,24.0,37.7,35.9,36.8
4.5,28.5,27.3,27.0,28.2,32.6,23.5,32.3,23.1,31.2,36.0,37.5,34.8,23.4,30.0,37.7,38.7,24.0,39.7,38.0,38.7
5.0,28.9,27.9,27.7,28.5,33.4,23.5,33.1,23.2,33.2,37.6,38.6,36.5,23.4,33.2,39.0,40.2,24.0,40.9,39.6,40.2
5.5,29.2,28.2,28.3,28.9,34.0,23.5,33.7,23.3,35.0,38.7,39.4,37.9,23.5,35.6,39.9,41.2,24.0,41.9,40.7,41.0
6.0,29.4,28.5,28.6,29.1,34.4,24.9,34.2,23.3,36.4,39.6,40.0,38.9,23.5,37.3,40.6,42.0,24.1,42.5,41.6,41.2
6.5,29.5,28.8,28.9,29.3,34.7,27.0,34.6,23.3,37.6,40.4,40.4,39.7,23.5,38.7,41.1,42.5,24.1,43.1,42.3,41.7
7.0,29.6,29.0,29.1,29.5,34.9,28.8,34.8,23.5,38.6,40.9,40.8,40.2,23.5,39.7,41.4,42.9,24.1,43.4,42.8,42.3
7.5,29.7,29.2,29.2,29.6,35.1,30.5,35.0,24.9,39.3,41.4,41.1,40.6,23.6,40.5,41.7,43.2,24.0,43.7,43.1,42.9
8.0,29.8,29.3,29.3,29.7,35.2,31.8,35.2,26.9,40.0,41.6,41.3,40.9,23.6,41.1,42.0,43.4,24.2,43.8,43.3,43.3
8.5,29.8,29.4,29.4,29.8,35.3,32.8,35.4,28.9,40.5,41.8,41.4,41.2,23.6,41.6,42.2,43.5,27.0,43.9,43.5,43.6
9.0,29.9,29.5,29.5,29.9,35.4,33.6,35.5,30.5,40.8,41.8,41.6,41.4,23.6,41.9,42.4,43.7,30.8,44.0,43.6,43.8
9.5,29.9,29.6,29.5,30.0,35.5,34.2,35.6,31.7,41.0,41.8,41.7,41.5,23.6,42.2,42.5,43.7,33.9,44.0,43.7,44.0
10.0,30.0,29.7,29.6,30.0,35.5,34.6,35.7,32.7,41.1,41.9,41.8,41.7,23.6,42.4,42.6,43.8,36.2,44.0,43.7,44.1
10.5,30.0,29.7,29.6,30.1,35.6,35.0,35.7,33.3,41.2,41.9,41.8,41.8,23.6,42.6,42.6,43.8,37.9,44.0,43.8,44.2
11.0,30.0,29.7,29.6,30.1,35.7,35.2,35.8,33.8,41.3,41.9,41.9,41.8,24.0,42.9,42.7,43.8,39.3,,43.8,44.3
11.5,30.0,29.8,29.7,30.1,35.8,35.4,35.8,34.1,41.4,41.9,42.0,41.8,26.6,43.1,42.7,43.9,40.2,,43.8,44.3
12.0,30.0,29.8,29.7,30.1,35.8,35.5,35.9,34.3,41.4,42.0,42.0,41.9,30.3,43.3,42.7,43.9,40.9,,43.9,44.3
12.5,30.1,29.8,29.7,30.2,35.9,35.7,35.9,34.5,41.5,42.0,42.0,,33.4,43.4,42.7,44.0,41.4,,43.9,44.3
13.0,30.1,29.8,29.8,30.2,35.9,35.8,36.0,34.7,41.5,42.0,42.1,,35.8,43.5,42.7,44.0,41.8,,43.9,44.4
13.5,30.1,29.9,29.8,30.2,36.0,36.0,36.0,34.8,41.5,42.0,42.1,,37.7,43.5,42.8,44.1,42.0,,43.9,44.4
14.0,30.1,29.9,29.8,30.2,36.0,36.1,36.0,34.9,41.6,,42.2,,39.0,43.5,42.8,44.1,42.1,,,44.4
14.5,,29.9,29.8,,36.0,36.2,36.0,35.0,41.6,,42.2,,40.0,43.5,42.8,44.1,42.3,,,44.4
15.0,,29.9,,,36.0,36.3,,35.0,41.6,,42.2,,40.7,,42.8,44.1,42.4,,,
15.5,,,,,36.0,36.4,,35.1,41.6,,42.2,,41.3,,,,42.4,,,

To convert a whole DataFrame into a numpy array, use
df = df.values()
If i understood you correctly, you want seperate arrays for every trial though. This can be done like this:
data = [df.iloc[:, [0, i]].values() for i in range(1, 20)]
which will make a list of numpy arrays, every one containing the first column with temperature and one of the trial columns.

Related

Converting 0-1 values in dataset with the name of the column if the value of the cell is 1

I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.

Is there a pandas function to merge 2 dfs so that repeating items in the second df are added as columns to the first df?

I have a hard time to formulate this problem in abstract terms, therefore I will mostly try to explain it with examples.
I have 2 pandas dataframes (I get them from a sqlite DB).
First DF:
Second DF:
So the thing is: There are several images per "capture". I would like to add the images to the capture df as columns, so that each capture has 9 image columns, each with a path. There are always 9 images per capture.
I solved it in pandas with what I know in the following way:
cam_idxs = sorted(list(range(9)) * 2)
for cam_idx in cam_idxs:
sub_df = images.loc[(images["camera_id"]==cam_idx)]
captures = captures.merge(sub_df[["image", "capture_id"]], left_on="id",
right_on="capture_id")
I imagine though that there must be a better way. At least I imagine people probably stumble into this problem more often when getting data from a sql database.
Since I am getting the data into pandas from a sql database, I am also open to SQL commands that get me this result. And I'm also grateful for people telling me what this kind of operation is called, I did not find a good way to google for this, therefore I am asking here. Excuse me when this question was asked somewhere, I did not find anything with my searchterms.
So the question at the end is: Is there a better way to do this, especially a more efficient way to do this?
What you are looking for is the pivot table.
You just need to create a column containing the index of the number of image by capture_id that you will use as columns in the pivot table.
For example this could be :
images['column_pivot'] = [x for x in range(1,10)]*int(images.shape[0]/9)
In your case 'column_pivot' would be [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9...7,8,9] (e.g. rolling from 1 to 9)
Then you pivot :
pd.pivot_table(images, columns='column_pivot', index='capture_id', values='image')
This will give the expected result.

Does the function np.random.permutation work with or without replacement?

Im a doing an experiment several times (monte carlo simulation). In the experiment I select a random number of columns and do the calculations with the selected columns. Initially I have to the experiment with 5 columns, and repeat it 100 times. My dataframe has 4000 columns.
I summarized the code, and I am only showing the part of the code which select randomly the columns. Please take into account that I repeat this procedure 100 times. The part of the code is as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols].copy()
My question is the following:
Does the function np.random.permutation work with replacement or without replacement?
The reason I ask this question it is because I run the code several times, at the moment it is 100 times. I will probably run the code more time and I need to know if the function works with replacement. If the function does not work with replacement, is there another function that does the same and work with replacement?
Thanks
From the documentation examples it is clear that numpy.random.permutation doesn't replace from your selection set.

Finding time difference between columns

I am currently working with a dataset which has two DateTime columns: ACTUAL_SHIPMENT_DTM and SHIPMENT_CONFIRMED_DTM.
I am trying to find the difference in time between the two columns. I have tried the following code but the output is giving me the time difference of one column based on the rows. Basically I want a new column to be populated with the time difference of (ACTUAL_SHIPMENT_DTM - SHIPMENT_CONFIRMED_DTM).
Golden['Cycle_TIme'] = Golden.groupby('ACTUAL_SHIPMENT_DTM')
['SHIPMENT_CONFIRMED_DTM'].diff().dt.total_seconds()
Can anyone see errors in my code or guide me to proper documentation?
Lol I underestimated myself and asked a question way too soon. Well if anyone wants to know how to find the time difference between two columns here is my example code. Golden = DataFrame
Golden['Cycle_TIme'] = Golden["SHIPMENT_CONFIRMED_DTM"]-
Golden["ACTUAL_SHIPMENT_DTM"]

Categories

Resources