Grouping CSV Rows By The Names of Users - python

I have a table on Python with the following data from a CSV:
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| sandbox#aol.com |Tyler | Porter |
I want to be able to group the data by the Name and have all of the other cells come with it.
It should end up looking like this.
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| sandbox#aol.com |Tyler | Porter |
Furthermore, I want to create a new csv file for every table that is created. Can anyone help? I have tried to loop it through but have failed too many times. I am currently back to the base and only have the file propagting in its normal table. Can anyone help?
import csv
f = open('work.csv')
csv_f = csv.reader(f)
for row in csv_f:
print (row)

When you are trying to group variables based on a certain key (the name in this case) a hashmap is usually a good data structure to try.
As a general solution for future readers:
Create an empty dictionary.
Choose the key that you want to group your data.
Iterate over the data and parse the key and related items.
Add the related items to dict[key].
Now each key in dict will have a list of all the items related to it.
Tailored more specifically to the OP's question:
import collections
def write_csv(name, lines):
with open(f"{name}_work.csv", "w") as f:
for line in lines:
f.write(','.join(item for item in line))
f.write('\n')
if __name__ == "__main__":
# LOAD DATA
with open("work.csv", 'r') as f:
lines = []
for line in f.readlines():
lines.append(line.strip('\n').split(','))
# GROUP DATA BY NAME INTO A DICTIONARY
names = collections.defaultdict(list)
for email, name, job in lines[1:]:
names[name].append((email, job))
# WRITE A NEW .csv FILE FOR EACH NAME
for name in names:
new_lines = lines[:1]
for email, job in names[name]:
new_lines.append([name, email, job])
write_csv(name, new_lines)

Related

Pyspark mapping regex

I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,
You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.
Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

Merge two CSV columns and match up

I have a CSV with three major columns I need to infuse.
One of them is a name of the product called "Material"
One of them is the group name called "Serial"
The final is "Related" which matches the Martial with the Serial
At the moment the CSV will look like the following:
(example, has more fields and different data)
Martial | Serial | Related
ExOne | GroupOne |
ExTwo | GroupOne |
ExThree | GroupOne |
ExFour | GroupTwo |
ExFive | GroupTwo |
ExSix | GroupThree |
I need to match each martial to each over by the serial but limited to five (and separated by "///"
The example outcome should look like the following:
Martial | Serial | Related
ExOne | GroupOne | ExOne///ExTwo///ExThree
ExTwo | GroupOne | ExOne///ExTwo///ExThree
ExThree | GroupOne | ExOne///ExTwo///ExThree
ExFour | GroupTwo | ExFour///ExFive
ExFive | GroupTwo | ExFour///ExFive
ExSix | GroupThree | ExSix
This is my first attempt at Python and the code that i've tried at the moment is only touching on what I said. The way I'm building the code is bit by bit, the first bit (aim) is to match the serial groups and list all martial items under, for example:
GroupOne
ExOne
ExTwo
ExThree
GroupTwo
ExFour
ExFive
GroupSix
ExSix
Then from there I can make cases and combine by factors (if more then 5 ect)
import csv
import sys
with open('EGLOINDOORCSV.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
Materials = []
Serials = []
for row in readCSV:
Material = row[0]
Serial = row[4]
Materials.append(Material)
Serials.append(Serial)
if Serial == Serial:
print(Serial)
print(Material, end = "///")
print("\n")
break
print("Done")
First let's recreate a sample file:
data = '''\
Martial|Serial|Related
ExOne|GroupOne|
ExTwo|GroupOne|
ExThree|GroupOne|
ExFour|GroupTwo|
ExFive|GroupTwo|
ExSix|GroupThree|'''
with open('test.csv', 'w') as f:
f.write(data)
Now the actual code using Pandas (Pandas comes together with the Anaconda package). Use pip install pandas to install it without anaconda.
import pandas as pd
df = pd.read_csv('test.csv', sep='|')
df['Related'] = df['Serial'].map(df.groupby('Serial')['Martial']
.apply(lambda x: '///'.join(x)))
df.to_csv('output.csv', index=False)
Returns:
Martial Serial Related
0 ExOne GroupOne ExOne///ExTwo///ExThree
1 ExTwo GroupOne ExOne///ExTwo///ExThree
2 ExThree GroupOne ExOne///ExTwo///ExThree
3 ExFour GroupTwo ExFour///ExFive
4 ExFive GroupTwo ExFour///ExFive
5 ExSix GroupThree ExSix
My approach is to read the CSV twice. In the first pass, I gather related information and in the second, output:
import csv
# Pass 1: gather related materials
with open('EGLOINDOORCSV.csv') as csvfile:
reader = csv.reader(csvfile)
related = {}
for row in reader:
material = row[0]
serial = row[1]
related.setdefault(serial, set()).add(material)
# print(related) # for debugging
# Pass 2: print
with open('EGLOINDOORCSV.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
material = row[0]
serial = row[1]
print('%s | %s | %s' % (material, serial, '///'.join(sorted(related[serial]))))
Output:
ExOne | GroupOne | ExOne///ExThree///ExTwo
ExTwo | GroupOne | ExOne///ExThree///ExTwo
ExThree | GroupOne | ExOne///ExThree///ExTwo
ExFour | GroupTwo | ExFive///ExFour
ExFive | GroupTwo | ExFive///ExFour
ExSix | GroupThree | ExSix
Notes
I assume your CSV file does not have a header. If you do, you will need to skip it:
reader = csv.reader(csvfile)
next(reader) # Skip the header, then move on
Based on the CSV you supplied, I assigned row[0] to material, please adjust the index number to match your file
About the related dictionary
This dictionary is where I keep the relations, it looks like this:
{
"GroupTwo": set(["ExFour", "ExFive"]),
"GroupOne": set(["ExOne", "ExThree", "ExTwo"]),
"GroupThree": set(["ExSix"])
}
In my code, the statement:
related.setdefault(serial, set()).add(material)
is a shorthand for:
if serial not in related:
related[serial] = set()
related[serial].add(material)
This is the approach using inbox itertools, you don't need to install any extra package. Then this is how to write it in a pythonistic way also using dictionary and list comprehension.
Step by step approach:
#reading all file at once
import csv
with open('EGLOINDOORCSV.csv') as csvfile:
l=[r for r in csv.reader(csvfile, delimiter=r',')][1:] #skip header
#itertools requires sorted data. Sorting by second field.
key=lambda x: x[1]
l = sorted( l, key = key)
#grouping to an aux dictionary
from itertools import groupby
d={ k: "///".join( x[0] for x in g) for k,g in groupby( l, key) }
#updating third column from aux dictionary
for x in l:
x[2]=d[x[1]]
Et voilà!
#this is the content of l, ready to go back to a new csv
[
['ExOne', 'GroupOne', 'ExOne///ExTwo///ExThree'],
['ExTwo', 'GroupOne', 'ExOne///ExTwo///ExThree'],
['ExThree', 'GroupOne', 'ExOne///ExTwo///ExThree'],
['ExSix', 'GroupThree', 'ExSix'],
['ExFour', 'GroupTwo', 'ExFour///ExFive'],
['ExFive', 'GroupTwo', 'ExFour///ExFive'],
]
Disclaimer: This is a vanilla solution, all in the box, but remember, pandas is your friend handling data, take in mind to install it and move to a pandas solution if you need to manage lots of data.
Raw data
$cat EGLOINDOORCSV.csv
Martial,Serial,Related
ExOne,GroupOne,
ExTwo,GroupOne,
ExThree,GroupOne,
ExFour,GroupTwo,
ExFive,GroupTwo,
ExSix,GroupThree,

How do I save the header and units of an astropy Table into an ascii file

I'm trying to create an ascii table with some information on the header, the names and units of the columns and some data, it should look like this:
# ... Header Info ...
Name | Morphology | ra_u | dec_u | ...
| InNS+B+MOI | HH:MM:SS.SSS | ±DD:MM:SS:SSS| ...
==============| ========== | ============ | ============ | ...
1_Cam_A | I | 04:32:01.845 | +53:54:39.03 ...
10_Lac | I | 22:39:15.679 | +39:03:01.01 ...
...
So far I've tried with numpy.savetxt and astropy.ascii.writhe, numpy won't really solve my problems and with ascii.write I've been able to get something similar but not quite right:
Name | Morphology | ra_u | dec_u | ...
================== | ========== | ============ | ============ | ...
1_Cam_A | I | 04:32:01.845 | +53:54:39.03 ...
...
I'm using this code:
formato= {'Name':'%-23s','Morphology':'%-10s','ra_u':'%s','dec_u':'%s',...}
names=['Name','Morphology','ra_u','dec_u','Mag6']
units=['','InNS+B+MOI','HH:MM:SS.SSS','±DD:MM:SS:SSS',...]
ascii.write(data, output='pb.txt',format='fixed_width_two_line',position_char='=',delimiter=' | ',names=names, formats=formato)
So if I make a print in my terminal the table looks as it should except for the header info, but as I save it into a file the units disappear...
Is there any way to include them in the file?, or I need to save the file and edit it later?
P.D.: I'm also tried some other formats such as IPAC for ascii.write, in that case the problem is that includes a 4th row in the header like: '| null | null |.....' and I don't know how to get rid of it...
Thanks for the help
Un saludo.
There doesn't appear to be a straightforward way to write out the units of a column in a generic way using astropy.table or astropy.io.ascii. You may want to raise an issue at https://github.com/astropy/astropy/issues with a feature request.
However, there is a pretty simple workaround using the format ascii.ipac:
tbl.write('test.txt', format='ascii.ipac')
with open('test.txt', 'r') as fh:
output = []
for ii, line in enumerate(fh):
if ii not in (1,3):
output.append(line)
with open('test.txt', 'w') as fh:
fh.writelines(output)
which will write out in the IPAC format, then remove the 2nd and 4th lines.
Unless your table absolute has to be in that format, if you want an ASCII table with more complex metadata for the columns please consider using the ECSV format.

Categories

Resources