My data is organized in a data frame with the following structure
| ID | Post | Platform |
| -------- | ------------------- | ----------- |
| 1 | Something #hashtag1 | Twitter |
| 2 | Something #hashtag2 | Insta |
| 3 | Something #hashtag1 | Twitter |
I have been able to extract and count the hashtag using the following (using this post):
df.Post.str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
I am now trying to count hashtag operation occurrence from each platform. I am trying the following:
df.groupby(['Post', 'Platform'])['Post'].str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
But, I am getting the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'str'
We can solve this easily using 2 steps.Assumption each post has just single hashtag
Step 1: Create a new column with Hashtag
df['hashtag']= df.Post.str.extractall(r'(\#\w+)')[0].reset_index()[0]
Step 2: Group by and get the counts
df.groupby([ 'Platform']).hashtag.count()
Generic Solutions Works for any number of hashtag
We can solve this easily using 2 steps.
# extract all hashtag
df1 = df.Post.str.extractall(r'(\#\w+)')[0].reset_index()
# Ste index as index of original tagle where hash tag came from
df1.set_index('level_0',inplace = True)
df1.rename(columns={0:'hashtag'},inplace = True)
df2 = pd.merge(df,df1,right_index = True, left_index = True)
df2.groupby([ 'Platform']).hashtag.count()
Related
result = {'data1': [1,2], 'data2': [4,5]}
I know how to write if I key has single value, but I this case it has list of values how I can iterate. Create table in BigQuery as follows:
| data1 | data2 |
| -------- | -------------- |
| 1 | 4 |
| 2 | 5 |
I had to same problem and I couldnt find a solution, so what I did(All our DB's are managed as service in our code) is the following:
I am fetching the table id, then using get_table(table_id) .
Now I am sending the table._properties['schema']['fields'] with the list data to a function that convert it into Json
def __Convert_Data_To_Json(self,data,table_fields):
if not(len(table_fields)==len(data[0])):
raise CannotCreateConnectionError(CANNOT_CREATE_CONNECTION_MESSAGE % str('Data length doesnt match table feild list'))
for row in data:
dataset={}
i=0
for col in row:
if col==None:
dataset[list(table_fields[i].values())[0] ]=None
else:
dataset[list(table_fields[i].values())[0] ]=str(col)
i+=1
self.__rows_to_insert.append(dataset)
and then using insert_rows_json
This question already has answers here:
Printing Lists as Tabular Data
(20 answers)
Closed 3 years ago.
I want to make a table in python
+----------------------------------+--------------------------+
| name | rank |
+----------------------------------+--------------------------+
| {} | [] |
+----------------------------------+--------------------------+
| {} | [] |
+----------------------------------+--------------------------+
But the problem is that I want to first load a text file that should contain domains name and then I would like to making a get request to each domain one by one and then print website name and status code in table format and table should be perfectly align. I have completed some code but failed to display output in a table format that should be in perfectly align as you can see in above table format.
Here is my code
f = open('sub.txt', 'r')
for i in f:
try:
x = requests.get('http://'+i)
code = str(x.status_code)
#Now here I want to display `code` and `i` variables in table format
except:
pass
In above code I want to display code and i variables in table format as I showed in above table.
Thank you
You can achieve this using the center() method of string. It creates and returns a new string that is padded with the specified character.
Example,
f = ['AAA','BBBBB','CCCCCC']
codes = [401,402,105]
col_width = 40
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
print("|"+"Name".center(col_width)+"|"+"Rank".center(col_width)+"|")
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
for i in range(len(f)):
_f = f[i]
code = str(codes[i])
print("|"+code.center(col_width)+"|"+_f.center(col_width)+"|")
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
Output
+----------------------------------------+----------------------------------------+
| Name | Rank |
+----------------------------------------+----------------------------------------+
| 401 | AAA |
+----------------------------------------+----------------------------------------+
| 402 | BBBBB |
+----------------------------------------+----------------------------------------+
| 105 | CCCCCC |
+----------------------------------------+----------------------------------------+
I am new to pandas. The following is a sub_set of a dataframe named news:
Id is the id of news and the text column includes the news:
Id text
1 the news is really bad.
2 I do not have any courses.
3 Asthma is very prevalent.
4 depression causes disability.
I am going to calculate sentiment for each news in the "text" column.
I need to create a column to include the result of sentiment analysis.
This is my code:
from textblob import TextBlob
review = TextBlob(news.loc[0,'text'])
print (review.sentiment.polarity)
This code works for just one of the news in the text column.
I also wrote this function:
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
But it has the following error:
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
Any solution?
I cannot reproduce your bug: your exact code is working perfectly fine to me using pandas==0.24.2 and Python 3.4.3:
import pandas as pd
from textblob import TextBlob
news = pd.DataFrame(["the news is really bad.",
"I do not have any courses.",
"Asthma is very prevalent.",
"depression causes disability."], columns=["text"])
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
display(news)
Result:
+----+-------------------------------+-------------+
| | text | sentiment |
|----+-------------------------------+-------------|
| 0 | the news is really bad. | -0.7 |
| 1 | I do not have any courses. | 0 |
| 2 | Asthma is very prevalent. | 0.2 |
| 3 | depression causes disability. | 0 |
+----+-------------------------------+-------------+
I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')
I am playing a little with PrettyTable in Python and I noticed completely different behavior in Python2 and Python 3. Can somebody exactly explain me the difference in output? Nothing in docs gave me satisfied answer for that. But let's start with little code. Let's start with creating my_table
from prettytable import PrettyTable
my_table = PrettyTable()
my_table.field_name = ['A','B']
It creates two column table with column A and column B. Let's add on row to it, but assume that value in cell can have multi lines, separated by Python new line '\n' , as the example some properties of parameter from column A.
row = ['parameter1', 'component: my_component\nname:somename\nmode: magic\ndate: None']
my_table.add_row(row)
Generally the information in row can be anything, it's just a string retrieved from other function. As you can see, it has '\n' inside. The thing that I don't completely understand is the output of print function.
I have in Python2
print(my_table.get_string().encode('utf-8'))
Which have me output like this:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
But in Python3 I have:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
If I completely removes the encode part, it seems that output looks ok on both version of Python.
So when I have
print(my_table.get_string())
It works on Python3 and Python2. Should I remove the encode part from code? It is save to assume it is not necessary? Where is the problem exactly?