result = {'data1': [1,2], 'data2': [4,5]}
I know how to write if I key has single value, but I this case it has list of values how I can iterate. Create table in BigQuery as follows:
| data1 | data2 |
| -------- | -------------- |
| 1 | 4 |
| 2 | 5 |
I had to same problem and I couldnt find a solution, so what I did(All our DB's are managed as service in our code) is the following:
I am fetching the table id, then using get_table(table_id) .
Now I am sending the table._properties['schema']['fields'] with the list data to a function that convert it into Json
def __Convert_Data_To_Json(self,data,table_fields):
if not(len(table_fields)==len(data[0])):
raise CannotCreateConnectionError(CANNOT_CREATE_CONNECTION_MESSAGE % str('Data length doesnt match table feild list'))
for row in data:
dataset={}
i=0
for col in row:
if col==None:
dataset[list(table_fields[i].values())[0] ]=None
else:
dataset[list(table_fields[i].values())[0] ]=str(col)
i+=1
self.__rows_to_insert.append(dataset)
and then using insert_rows_json
My data is organized in a data frame with the following structure
| ID | Post | Platform |
| -------- | ------------------- | ----------- |
| 1 | Something #hashtag1 | Twitter |
| 2 | Something #hashtag2 | Insta |
| 3 | Something #hashtag1 | Twitter |
I have been able to extract and count the hashtag using the following (using this post):
df.Post.str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
I am now trying to count hashtag operation occurrence from each platform. I am trying the following:
df.groupby(['Post', 'Platform'])['Post'].str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
But, I am getting the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'str'
We can solve this easily using 2 steps.Assumption each post has just single hashtag
Step 1: Create a new column with Hashtag
df['hashtag']= df.Post.str.extractall(r'(\#\w+)')[0].reset_index()[0]
Step 2: Group by and get the counts
df.groupby([ 'Platform']).hashtag.count()
Generic Solutions Works for any number of hashtag
We can solve this easily using 2 steps.
# extract all hashtag
df1 = df.Post.str.extractall(r'(\#\w+)')[0].reset_index()
# Ste index as index of original tagle where hash tag came from
df1.set_index('level_0',inplace = True)
df1.rename(columns={0:'hashtag'},inplace = True)
df2 = pd.merge(df,df1,right_index = True, left_index = True)
df2.groupby([ 'Platform']).hashtag.count()
This question already has answers here:
Printing Lists as Tabular Data
(20 answers)
Closed 3 years ago.
I want to make a table in python
+----------------------------------+--------------------------+
| name | rank |
+----------------------------------+--------------------------+
| {} | [] |
+----------------------------------+--------------------------+
| {} | [] |
+----------------------------------+--------------------------+
But the problem is that I want to first load a text file that should contain domains name and then I would like to making a get request to each domain one by one and then print website name and status code in table format and table should be perfectly align. I have completed some code but failed to display output in a table format that should be in perfectly align as you can see in above table format.
Here is my code
f = open('sub.txt', 'r')
for i in f:
try:
x = requests.get('http://'+i)
code = str(x.status_code)
#Now here I want to display `code` and `i` variables in table format
except:
pass
In above code I want to display code and i variables in table format as I showed in above table.
Thank you
You can achieve this using the center() method of string. It creates and returns a new string that is padded with the specified character.
Example,
f = ['AAA','BBBBB','CCCCCC']
codes = [401,402,105]
col_width = 40
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
print("|"+"Name".center(col_width)+"|"+"Rank".center(col_width)+"|")
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
for i in range(len(f)):
_f = f[i]
code = str(codes[i])
print("|"+code.center(col_width)+"|"+_f.center(col_width)+"|")
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
Output
+----------------------------------------+----------------------------------------+
| Name | Rank |
+----------------------------------------+----------------------------------------+
| 401 | AAA |
+----------------------------------------+----------------------------------------+
| 402 | BBBBB |
+----------------------------------------+----------------------------------------+
| 105 | CCCCCC |
+----------------------------------------+----------------------------------------+
I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.
Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)
I am playing a little with PrettyTable in Python and I noticed completely different behavior in Python2 and Python 3. Can somebody exactly explain me the difference in output? Nothing in docs gave me satisfied answer for that. But let's start with little code. Let's start with creating my_table
from prettytable import PrettyTable
my_table = PrettyTable()
my_table.field_name = ['A','B']
It creates two column table with column A and column B. Let's add on row to it, but assume that value in cell can have multi lines, separated by Python new line '\n' , as the example some properties of parameter from column A.
row = ['parameter1', 'component: my_component\nname:somename\nmode: magic\ndate: None']
my_table.add_row(row)
Generally the information in row can be anything, it's just a string retrieved from other function. As you can see, it has '\n' inside. The thing that I don't completely understand is the output of print function.
I have in Python2
print(my_table.get_string().encode('utf-8'))
Which have me output like this:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
But in Python3 I have:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
If I completely removes the encode part, it seems that output looks ok on both version of Python.
So when I have
print(my_table.get_string())
It works on Python3 and Python2. Should I remove the encode part from code? It is save to assume it is not necessary? Where is the problem exactly?