How to sum a column in python without calling it a dataframe

How to sum a column in python without calling it a dataframe - python

I have data that outputs into a csv file as:
url date id hits
a 2017-01-01 123 2
a 2017-01-01 123 2
b 2017-01-01 45 25
c 2017-01-01 123 5
d 2017-01-03 678 1
d 2017-01-03 678 7
and so on where hits is the number of times the id value appears on a given day per url. (ie: the id number 123 appears 2 times on 2017-01-01 for url "a".
I need to create another column after hits, called "total hits" that captures the total number of hits there are per day for a given url, date and id value. So the output would look like this..
url date id hits total_hits
a 2017-01-01 123 2 4
a 2017-01-01 123 2 4
b 2017-01-01 45 25 25
c 2017-01-01 123 5 5
d 2017-01-03 678 1 8
d 2017-01-03 678 7 8
if there are solutions to this without using pandas or numpy that would be amazing.
Please help! Thanks in advance.

Simple with standard python installation.
read & parse file using line-by-line read & split
create a collections.defaultdict(int) to count the occurences of the url/date/id triplet
add the info in an extra column
write back (I chose csv)
like this:
import collections,csv
d = collections.defaultdict(int)
rows = []
with open("input.csv") as f:
title = next(f).split() # skip title
for line in f:
toks = line.split()
d[toks[0],toks[1],toks[2]] += int(toks[3])
rows.append(toks)
# complete data
for row in rows:
row.append(d[row[0],row[1],row[2]])
title.append("total_hits")
with open("out.csv","w",newline="") as f:
cw = csv.writer(f)
cw.writerow(title)
cw.writerows(rows)
here's the output file:
url,date,id,hits,total_hits
a,2017-01-01,123,2,4
a,2017-01-01,123,2,4
b,2017-01-01,45,25,25
c,2017-01-01,123,5,5
d,2017-01-03,678,1,8
d,2017-01-03,678,7,8

Related

Loop through dictionary and append them to a dataframe using iterrows

I'm still quite new to Python and Pandas. I need some help with the code that i'm working with.
I have a dictionary called df which contains some files and their content in txt format. Key of this dictionary is the filename (date.txt), and value is the content of it. Here's what it looks like:
{'02_01_2020': 0
0 1 229017 Cust_1 CUR ...
1 2 629324 Cust_2 CUR ...
2 3 863300 Cust_3 CUR ...
3 4 670338 Cust_4 CUR ...
4 5 987039 Cust_5 CUR ...
5 6 485912 Cust_6 CUR ...,'03_01_2020': 0
0 1 122403 Cust_1 CUR ...
1 2 779269 Cust_2 CUR ...
2 3 728965 Cust_3 CUR ...
3 4 527716 Cust_4 CUR ...
4 5 796179 Cust_5 CUR ...
5 6 027872 Cust_6 CUR ...
6 7 449767 Cust_7 CUR ...
7 8 598752 Cust_8 CUR ...
8 9 180422 Cust_9 CUR ..., .... goes until the last file ('31_01_2020')}
As you can see above, each file contains different data. File 02_01_2020.txt has 6 entries, file 03_01_2020.txt has 9 entries, and so on until the last file (31_01_2020.txt).
My goal here is to separate the necessary information into their own columns (customer name, currency, etc) with the filename inserted into a separate column called paid_date. I used iterrows() to loop through this dictionary file. Here's the code:
def data_process(df):
#dataframe that i created outside this function
global df_data_1
for key,value in df.items():
df1 = pd.DataFrame(value)
df1['Paid_date'] = key.replace('_', '/')
#df1.insert(1, 'Paid_date', key.replace('_','/')) - another attempt to insert the col
for index,row in df1.iterrows():
df_Item_Num = row.str.slice(start = 0, stop=2) # entry number
df_DUMP_1 = row.str.slice(start = 0, stop=23) # not used
df_NAME = row.str.slice(start = 23, stop=40)
df_CURRENCY = row.str.slice(start = 40, stop=54)
df_AMOUNT = row.str.slice(start = 54, stop=66)
df_DATE = row.str.slice(start = 68, stop=86)
df_DUMP_2 = row.str.slice(start = 87, stop=-1) # not used
df_ALL_ITEMS = pd.concat([df_Item_Num, df_NAME, df_CURRENCY, df_AMOUNT, df_DATE], ignore_index=True)
df_data_1 = df_data_1.append(df_ALL_ITEMS, ignore_index=True)
return df_data_1
When i disabled the column creation code where the key is passed df1['Paid_date'] = key.replace('_', '/'), the result looks like this:
0 1 2 3 4
0 1 Cust_1 CUR Amount Date_Time
1 2 Cust_2 CUR Amount Date_Time
2 3 Cust_3 CUR Amount Date_Time
3 4 Cust_4 CUR Amount Date_Time
4 5 Cust_5 CUR Amount Date_Time
.. .. ... ... ... ...
185 10 Cust_6 CUR Amount Date_Time
186 11 Cust_7 CUR Amount Date_Time
187 12 Cust_8 CUR Amount Date_Time
188 13 Cust_9 CUR Amount Date_Time
189 14 Cust_10 CUR Amount Date_Time
Which is exactly what i need, only the paid_date column is not included (I need the filename to be stored in each row that corresponds to that particular file. e.g. 02_01_2020 would be printed 6 times to 6 rows, 03_01_2020 to 9 rows, etc). However, when i enabled the column creation code, it ended up like this:
0 1 2 3 ... 6 7 8 9
0 1 02 Cust_1 ... Amount Date_Time
1 2 02 Cust_2 ... Amount Date_Time
2 3 02 Cust_3 ... Amount Date_Time
3 4 02 Cust_4 ... Amount Date_Time
4 5 02 Cust_5 ... Amount Date_Time
.. .. .. ... .. ... ... .. ... ..
185 10 31 Cust_6 ... Amount Date_Time
186 11 31 Cust_7 ... Amount Date_Time
187 12 31 Cust_8 ... Amount Date_Time
188 13 31 Cust_9 ... Amount Date_Time
189 14 31 Cust_10 ... Amount Date_Time
I have a couple of new empty columns and apparently the key (filename) is not fully inserted (only the date is somehow stored into the new column, month and year are not included). What is the most efficient way to fix this? Any help would be greatly appreciated. Thank you
Edit 1
The entries of each txt file that i'm working with looks something like this:
1 CUST_NAME_1 CURRENCY AMOUNT DATE_TIME
2 CUST_NAME_2 CURRENCY AMOUNT DATE_TIME
3 CUST_NAME_3 CURRENCY AMOUNT DATE_TIME
4 CUST_NAME_4 CURRENCY AMOUNT DATE_TIME
5 CUST_NAME_5 CURRENCY AMOUNT DATE_TIME
Inside the txt file, there's a lot of white space that separates the information as you can see above. What my code does first is to loop through the directory within my computer that stores all the files there, and append them to two lists. Here's the code:
#SET UP EMPTY LISTS & Dictionary
filelist = []
filename = []
df = {}
def file_process(mydir):
for path, dirs, files in os.walk(mydir):
for file in files:
if file.endswith('.txt'):
filelist.append(file)
filename.append(file[0:10])
return filelist, filename
The above code returns two lists.
filelist contains each txt file (02_01_2020.txt, 03_01_2020.txt, etc)
filename contains only the name of each file (02_01_2020, 03_01_2020, etc)
Then i wrote the following code to convert those two lists into a single dictionary (will avoid using df for dictionary name, thanks for the suggestion).
def dict_process(filelist, filename):
for key in filename:
for value in filelist:
df[key] = pd.read_csv(value, sep="delimiter", skiprows = [0,1,2,3,4,5,6,7,8], skipfooter=6, header=None)
filelist.remove(value)
break
return df
And the code above returns back the dictionary that i've written previously where the filename is set as the key, and all the file content as its value.
What i did (or thought i did) within the for index,row in df1.iterrows(): was to slice every single series that iterrows() returned and keep only the information that i want and concatenate them into an empty dataframe. Is this efficient? or is there another way?

Uniqueness Test on Dataframe column and cross reference with value in second column - Python

I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()

Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1

Pandas: Assign multi-index DataFrame with with DataFrame by index-level-0

Please, suggest a more suitable title for this question
I have: Two-level indexed DF (crated via groupby):
clicks yield
country report_date
AD 2016-08-06 1 31
2016-12-01 1 0
AE 2016-10-11 1 0
2016-10-13 2 0
I need:
Consequently take country by country data, process it and put it back:
for country in set(DF.get_level_values(0)):
DF_country = process(DF.loc[country])
DF[country] = DF_country
Where process add new rows to DF_country.
Problem is in last string:
ValueError: Wrong number of items passed 2, placement implies 1

I just modify your code, I change the process to add, Base on my understanding process is a self-define function right ?
for country in set(DF.index.get_level_values(0)): # change here
DF_country = DF.loc[country].add(1)
DF.loc[country] = DF_country.values #and here
DF
Out[886]:
clicks yield
country report_date
AD 2016-08-06 2 32
2016-12-01 2 1
AE 2016-10-11 2 1
2016-10-13 3 1
EDIT :
l=[]
for country in set(DF.index.get_level_values(0)):
DF1=DF.loc[country]
DF1.loc['2016-01-01']=[1,2] #adding row here
l.append(DF1)
pd.concat(l,axis=0,keys=set(DF.index.get_level_values(0)))
Out[923]:
clicks yield
report_date
AE 2016-10-11 1 0
2016-10-13 2 0
2016-01-01 1 2
AD 2016-08-06 1 31
2016-12-01 1 0
2016-01-01 1 2

Splitting, merging, sorting CSV

I have several CSV files containing measurements from several sensors
s1.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 88 -84 27
01/25/12;10:30:00;a; -80
01/25/12;10:50:00;b; -96 3 -88
01/25/12;09:00:00;b; -97 101
01/25/12;09:10:00;c; 28
s2.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 133
01/25/12;10:25:00;a; -8 -5
I'd like to create one CSV by source (a/b/c) with every measure in separated column sorted by date and hour
a.CSV :
date;hour;source;s1;s2
01/25/12;10:20:00;a; 88 -84 27; 133
01/25/12;10:25:00;a; ; -8 -5
01/25/12;10:30:00;a; -80;
...
I'm stuck here :
import glob
import csv
import os
os.system('cls')
sources = dict()
sensor = 0
filelist = glob.glob("*.csv")
for f in filelist:
reader = csv.DictReader(open(f),delimiter=";")
for row in reader:
# date = row['date'] # date later
hour = row['hour']
val = row['values']
source = row['source']
if not sources.has_key(source): # new source
sources[source] = list()
#
sources[source].append({'hour':hour, 'sensor'+`sensor`:val})
sensor+=1
I'm not sure the data structure is good to sort. I also fell like repeating column name.

Using your data provided, I cooked up something using Pandas. Please see code below.
The output, granted, is non-ideal, as the hour and source get repeated within a column. As I am learning too, I'd also welcome any expert input on whether Pandas can do what the OP is asking for!
In [1]: import pandas as pd
In [2]: s1 = pd.read_csv('s1.csv', delimiter=';', parse_dates=True)
In [3]: s1
Out[3]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
In [4]: s2 = pd.read_csv('s2.csv', delimiter=';', parse_dates=True)
In [5]: s2
Out[5]:
date hour source values
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [6]: joined = s1.append(s2)
In [7]: joined
Out[7]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [8]: grouped = joined.groupby('hour').sum()
In [9]: grouped.to_csv('a.csv')
In [10]: grouped
Out[10]:
date source values
hour
09:00:00 01/25/12 b -97 101
09:10:00 01/25/12 c 28
10:20:00 01/25/1201/25/12 aa 88 -84 27 133
10:25:00 01/25/12 a -8 -5
10:30:00 01/25/12 a -80
10:50:00 01/25/12 b -96 3 -88

If I understand correctly, you have multiple files, each corresponding to a given "sensor", with the identity of the sensor in the filename. You want to read the files, then write them out in to separate files again, this time divided by "source", with the data from the different sensors combined into several final rows.
Here's what I think you want to do:
Read the data in, and build a nested dictionary data structure, as follows:
The top level key would be the source (e.g. 'a').
The second level will be keyed by a (date, time) tuple.
The inner most level will be keyed by sensor, taken from the filename, and have the actual sensor readings as values.
You'd also want to keep track of all the sensors that have been seen.
To write the data out, you'd loop over the items of the outermost dictionary, creating a new output file for each one.
The rows of each file would be determined by sorting the keys of the next dictionary.
The last value of each row would be formed by concatenating the values of the innermost dict, filling in an empty string for any missing values.
Here's some code:
from collections import defaultdict
from datetime import datetime
import csv
import glob
import os
# data structure is data[source][date, time][sensor] = value, with "" as default value
data = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
sensors = []
filelist = glob.glob("*.csv")
# read old files
for fn in filelist:
sensor = os.path.splitext(fn)[0]
sensors.append(sensor)
with open(fn, 'rb') as f:
reader = csv.DictReader(f, delimiter=";")
for row in reader:
date = datetime.strptime(row['date'], '%m/%d/%y')
data[row['source']][date, row['hour']][sensor] = row['values']
sensors.sort() # note, this may not give the best sort order
header = ['date', 'hour', 'source'] + sensors
for source, source_data in data.iteritems():
fn = "{}.csv".format(source)
with open(fn, 'wb') as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(header)
for (date, time), hour_data in sorted(source_data.items()):
values = [hour_data[sensor] for sensor in sensors]
writer.writerow([date.strftime('%m/%d/%y'), time, source] + values)
I only convert the date field to an internal type because otherwise sorting based on dates won't work correctly (dates in January 2013 would appear before those in February 2012). In the future, consider using ISO 8601 style date formating, YYYY-MM-DD, which can be safely sorted as a string. The rest of the values are handled only as strings with no interpretation.
The code assumes that the sensor values can be ordered lexicographically. This is likely if you only have a few of them, e.g. s1 and s2. However, if you have a s10, it will be sorted ahead of s2. To solve this you'll need a "natural" sort, which is more complicated than I can solve here (but see this recent question for more info).
One final warning: This solution may do bad things if you run it mutliple times in the same folder. That's because the output files, e.g. a.csv will be seen by glob.glob('*.csv') as input files when you run again.

django csv header and row format problem

I am trying to create csv download ,but result download gives me in different format
def csv_download(request):
import csv
import calendar
from datetime import *
from dateutil.relativedelta import relativedelta
now=datetime.today()
month = datetime.today().month
d = calendar.mdays[month]
# Create the HttpResponse object with the appropriate CSV header.
response = HttpResponse(mimetype='text/csv')
response['Content-Disposition'] = 'attachment; filename=somefilename.csv'
m=Product.objects.filter(product_sellar = 'jhon')
writer = csv.writer(response)
writer.writerow(['S.No'])
writer.writerow(['product_name'])
writer.writerow(['product_buyer'])
for i in xrange(1,d):
writer.writerow(str(i) + "\t")
for f in m:
writer.writerow([f.product_name,f.porudct_buyer])
return response
output of above code :
product_name
1
2
4
5
6
7
8
9
1|10
1|1
1|2
.
.
.
2|7
mgm | x_name
wge | y_name
I am looking out put like this
s.no porduct_name product_buyser 1 2 3 4 5 6 7 8 9 10 .....27 total
1 mgm x_name 2 3 8 13
2 wge y_name 4 9 13
can you please help me with above csv download ?
if possible can you please tell me how to sum up all the individual user total at end?
Example :
we have selling table in that every day seller info will be inserted
table data looks like
S.no product_name product_seller sold Date
1 paint jhon 5 2011-03-01
2 paint simth 6 2011-03-02
I have created a table where it displays below format and i am trying to create csv download
s.no prod_name prod_sellar 1-03-2011 2-03-2011 3-03-2011 4-03-2011 total
1 paint john 10 15 0 0 25
2 paint smith 2 6 2 0 10

Please read the csv module documentation, particularly the writer object API.
You'll notice that the csv.writer object takes a list with elements representing their position in your delimited line. So to get the desired output, you would need to pass in a list like so:
writer = csv.writer(response)
writer.writerow(['S.No', 'product_name', 'product_buyer'] + range(1, d) + ['total'])
This will give you your desired header output.
You might want to explore the csv.DictWriter class if you want to only populate some parts of the row. It's much cleaner. This is how you would do it:
writer = csv.DictWriter(response,
['S.No', 'product_name', 'product_buyer'] + range(1, d) + ['total'])
Then when your write command would follow as so:
for f in m:
writer.writerow({'product_name': f.product_name, 'product_buyer': f.product_buyer})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sum a column in python without calling it a dataframe - python

Related

Loop through dictionary and append them to a dataframe using iterrows

Uniqueness Test on Dataframe column and cross reference with value in second column - Python

Pandas: Assign multi-index DataFrame with with DataFrame by index-level-0

Splitting, merging, sorting CSV

django csv header and row format problem

Categories

Resources