Is it possible for there to be dataframe where (for example) there is a column called "data", and each element in the column was a numpy array?
| Data | Time |
| [1, 2, 3, ... 10] | June 12, 2020 |
| [11, 12, ..., 20] | June 13, 2020 |
If so, how do you create a dataframe in this format?
Not sure you want to do it this way, but it works.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Data': [np.array([1, 2, 3, 10]), np.array([11,12,13,20])], 'Time' : ['June 12, 2020', 'June 13, 2020']})
print (df)
Output:
Data Time
0 [1, 2, 3, 10] June 12, 2020
1 [11, 12, 13, 20] June 13, 2020
You can also do it with lists:
df = pd.DataFrame({'Data': [[1, 2, 3, 10], [11,12,13,20]], 'Time' : ['June 12, 2020', 'June 13, 2020']})
Yes you can, follow this question. It's useful when you data grouped by date, indexes, etc. Because you compress some rows but in terms of pandas operations maybe it isn't that efficient. Maybe you will prefer to use groupby() method and then apply operations.
Related
I have two CSV, one is the Master-Data and the other is the Component-Data, Master-Data has Two Rows and two columns, where as Component-Data has 5 rows and two Columns.
I'm trying to find the cosine-similarity between each of them after Tokenization, Stemming and Lemmatization and then append the similarity index to the new columns, I'm unable to append the corresponding values to the column in the data-frame which is further needs to be converted to CSV.
My Approach:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\\Similarity\\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\\Similarity\\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
expected output after converting the df to CSV,
Facing issues while appending the values to the Data-frame, Please assist me with an approach for this. Thanks.
Here's what I came up with:
Sample set up
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
We end up with 2 dataframes (showing df_cd):
SI.No
Sentences
0
1
Emma is writing a letter.
1
2
We wake up early in the morning.
2
3
Did Emma Write a letter?
3
4
We sleep early at night.
4
5
Emma wrote a letter.
I replaced the 2 functions you used by the following dummy functions:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
Processing data
First, we apply the fetchLemmantizedWords function on each dataframe. The regex replace, lowercase and split of the sentences is done by Pandas instead of doing them in the function itself.
By making the sentence lowercase first, we can simplify the regex to only consider lowercase letters.
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
Result for df_cd:
SI.No
Sentences
lem
0
1
Emma is writing a letter.
[29, 5, 4, 9, 28]
1
2
We wake up early in the morning.
[16, 8, 21, 14, 13, 4, 6]
2
3
Did Emma Write a letter?
[30, 9, 23, 16, 5]
3
4
We sleep early at night.
[8, 25, 24, 7, 3]
4
5
Emma wrote a letter.
[30, 30, 15, 7]
Next, we use a cross-join to make a dataframe with all possible combinations of md and cd data.
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged contents:
SI.No_md
lem_md
SI.No_cd
lem_cd
0
1
[14, 22, 9, 21, 4]
1
[3, 4, 8, 17, 2]
1
1
[14, 22, 9, 21, 4]
2
[29, 3, 10, 2, 19, 18, 21]
2
1
[14, 22, 9, 21, 4]
3
[20, 22, 29, 4, 3]
3
1
[14, 22, 9, 21, 4]
4
[17, 7, 1, 27, 19]
4
1
[14, 22, 9, 21, 4]
5
[17, 5, 3, 29]
5
2
[12, 30, 10, 11, 7, 11, 8]
1
[3, 4, 8, 17, 2]
6
2
[12, 30, 10, 11, 7, 11, 8]
2
[29, 3, 10, 2, 19, 18, 21]
7
2
[12, 30, 10, 11, 7, 11, 8]
3
[20, 22, 29, 4, 3]
8
2
[12, 30, 10, 11, 7, 11, 8]
4
[17, 7, 1, 27, 19]
9
2
[12, 30, 10, 11, 7, 11, 8]
5
[17, 5, 3, 29]
Next, we calculate the cosine value:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
In the last step, we pivot the data and merge the original df_cd with the calculated results :
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
Result (again, these are dummy calculations):
SI.No
Sentences
1
2
1
Emma is writing a letter.
100
64
2
We wake up early in the morning.
63
100
3
Did Emma Write a letter?
100
5
4
We sleep early at night.
100
17
5
Emma wrote a letter.
35
9
I have a dataset where i groupby the monthly data with the same id:
temp1 = listvar[2].groupby(["id", "month"])["value"].mean()
This results in this:
id month
SN10380 1 -9.670370
2 -8.303571
3 -4.932143
4 0.475862
5 5.732000
...
SN99950 8 6.326786
9 4.623529
10 1.290566
11 -0.867273
12 -2.485455
I then want to have each month and the corresponding value as a own column on the same ID, like this:
id month_1 month_2 month_3 month_4 .... month_12
SN10380 -9.670370 -8.303571 .....
SN99950
I have tried different solutions using apply(), transform() and agg(), but aren't able to produce the wanted output.
You could use unstack. Here's the sample code:
import pandas as pd
df = pd.DataFrame({
"id": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"month": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
"value": [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
})
temp1 = df.groupby(["id", "month"])["value"].mean()
temp1.unstack()
I hope it helps!
I have a array/list/pandas series :
np.arange(15)
Out[11]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
What I want is:
[[0,1,2,3,4,5],
[1,2,3,4,5,6],
[2,3,4,5,6,7],
...
[10,11,12,13,14]]
That is, recurently transpose this columns into a 5-column matrix.
The reason is that I am doing feature engineering for a column of temperature data. I want to use last 5 data as features and the next as target.
What's the most efficient way to do that? my data is large.
If the array is formatted like this :
arr = np.array([1,2,3,4,5,6,7,8,....])
You could try it like this :
recurr_transpose = np.matrix([[arr[i:i+5] for i in range(len(arr)-4)]])
The dataframe looks like this:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
5, 3713.110107421875, 2012-01-07T03:14:03.937Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
7, 3713.89892578125, 2012-01-07T03:14:13.968Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
10, 3714.64990234375, 2012-01-07T03:14:24.015Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
12, 3714.64990234375, 2012-01-07T03:14:29.031Z
At some rows, there are lines with millisecond different timestamps, I want to drop them and only keep the rows that have different second timestamps. there are rows that have the same value for milliseconds and seconds different rows like from row 9 to 12, therefore, I can't use a.loc[a.shift() != a]
The desired output would be:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
Try:
df.groupby(pd.to_datetime(df[2]).astype('datetime64[s]')).head(1)
I hope it's self-explained.
You can use below script. I didn't get your dataframe column names so I invented below columns ['x', 'date_time']
df = pd.DataFrame([
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:43.859Z')),
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:48.890Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:53.906Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:58.921Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.900Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.937Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.900Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.968Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:19.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.015Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.031Z'))],
columns=['x', 'date_time'])
create a column 'time_diff' to get the difference between the
datetime of current row and next row
only get those difference either
None or more than 1 second
drop temp column time_diff
df['time_diff'] = df.groupby('x')['date_time'].diff()
df = df[(df['time_diff'].isnull()) | (df['time_diff'].map(lambda x: x.seconds > 1))]
df = df.drop(['time_diff'], axis=1)
df
Is is possible to have a 3-D record array in numpy? (Maybe this is not possible, or there is simply an easier way to do things too -- I am open to other options).
Assume I want an array that holds data for 3 variables (say temp, precip, humidity), and each variable's data is actually a 2-d array of 2 years (rows) and 6 months of data (columns), I could create that like this:
>>> import numpy as np
>>> d = np.array(np.arange(3*2*6).reshape(3,2,6))
>>> d
#
# comments added for explanation...
# jan feb mar apr may Jun
array([[[ 0, 1, 2, 3, 4, 5], # yr1 temp
[ 6, 7, 8, 9, 10, 11]], # yr2 temp
[[12, 13, 14, 15, 16, 17], # yr1 precip
[18, 19, 20, 21, 22, 23]], # yr2 precip
[[24, 25, 26, 27, 28, 29], # yr1 humidity
[30, 31, 32, 33, 34, 35]]]) # yr2 humidity
I'd like to be able to type:
>>> d['temp']
and get this (the first "page" of the data):
>>> array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
or:
>>> d['Jan'] # assume months are Jan-June
and get this
>>> array([[0,6],
[12,18],
[24,30]])
I have been through this: http://www.scipy.org/RecordArrays a number of times, but don't see how set up what I am after.
Actually, you can do something similar to this with structured arrays, but it's generally more trouble than it's worth.
What you want is basically labeled axes.
Pandas (which is built on top of numpy) provides what you want, and is a better choice if you want this type of indexing. There's also Larry (for labeled array), but it's largely been superseded by Pandas.
Also, you should be looking at the numpy documentation for structured arrays for info on this, rather than an FAQ. The numpy documentation has considerably more information. http://docs.scipy.org/doc/numpy/user/basics.rec.html
If you do want to take a pure-numpy route, note that structured arrays can contain multidimensional arrays. (Note the shape argument when specifying a dtype.) This will rapidly get more complex than it's worth, though.
In pandas terminology, what you want is a Panel. You should probably get familiar with DataFrames first, though.
Here's how you'd do it with Pandas:
import numpy as np
import pandas
d = np.array(np.arange(3*2*6).reshape(3,2,6))
dat = pandas.Panel(d, items=['temp', 'precip', 'humidity'],
major_axis=['yr1', 'yr2'],
minor_axis=['jan', 'feb', 'mar', 'apr', 'may', 'jun'])
print dat['temp']
print dat.major_xs('yr1')
print dat.minor_xs('may')