#ossasepia | 2020-03-04

jfw: why u no write? It's been a whole week!!1

14:15

BingoBoingo: where are you with the scripts? kind of lost track of that part and saw only the drafts.

~ 51 minutes ~

15:06

diana_coman: Feeding urls from a file to curl using command substitution appears to fit in the hand. I'll clean up the pieces I have and get them in here.

15:14

diana_coman: because I'm ffa'ing it apparently, "can't possibly cut elephant into more manageable bites". Published nao.

15:19

jfw: ahaha, "eat your elephants in small pieces!"

15:23

BingoBoingo: more to the point: what steps do you have working, what did you obtain already with them, what's the next step and where are you with that ?

15:23

http://paste.deedbot.org/?id=x7wl << The website discovery pieces. I've got a start to a filter for "Things with these file extensions aren't interesting" and a start to a "Does this page have a comment box" tester.

15:24

Now that gathering works, the next step is cutting out the gathered items that aren't interesting.

15:24

BingoBoingo: did you run those on anything? on what? what did you get out of it? where do you run them next?

15:27

jfw: so you have in that very article some questions re the signatures thread - why didn't you ask those in #t?

15:30

diana_coman: I've run them starting from a few different sites. I get at the end a file 'churndomains4' full of website urls.

15:30

diana_coman: Since running out that many iterations gets very slow, for now I'm testing the filtering on the 'churn3' list of all urls collected from a bunch of discovered sites.

15:30

This is the most recent 'churn3' I've produced http://paste.deedbot.org/?id=VxhL

15:32

Here's the most recent (and smaller) churndomains4 http://paste.deedbot.org/?id=bKiT

15:33

BingoBoingo: uhm, I don't quite get it - are you after the sites or after all pages of a site? (and even ...images??)

15:35

diana_coman: they were pretty vague in my mind until spelling it all out now. Perhaps even still now, dunno; do they make sense to you?

15:35

BingoBoingo: to my mind the initial exploration aims to get literally as many domains as you can reach starting from a given point; so yes, it follows links from there but you don't really need to save other than those that point to *another* domain, do you?

15:36

BingoBoingo: what's though the core trouble you are having with this because it seems to me quite obviously going beyond curl/awk/sed/whatever command line ie you just don't see it as clear or specific enough steps at all, can't quite put my finger on it.

15:38

jfw: well, your article there is quite highly strung and rather visibly the result of pain-writing; but the way it looks it's quite as you say in footnote 1 - you torture the writing because it's not as definitive as you'd want it to be, huh.

15:41

jfw: the thing with questions though is that they are precisely exploratory - it's true that at times you can indeed ask questions to help the other party explore but not *all* questions are like that, lol; at times you literally ask to figure stuff out so yes, necessarily *before* things are clear, lol

15:42

jfw: specifically on the questions in footnote iv, the second one assumes the whiteout - it's unclear that is the desired approach to start with so maybe ask *that*? ie how would it work, maybe whiteout or something else/what?

15:44

the first one seems quite clear ie the underlying concern is that including signatures in the same place as the vpatch/text requires some clear separation of the roles of those 2 bunches of (ultimately) text; so how is that to be achieved?

15:44

jfw: is that what you are asking there?

15:44

so aiming too far even with the questions, hm.

15:45

jfw: what do you mean by "too far"?

15:46

trying to cover too much ground and possibly introducing bad assumptions rather than starting with something simpler

15:47

yes, the boundary between sigs and text is the root of it

15:48

diana_coman: It's not the most elegant approach, but I

15:48

'll try rearranging and presenting

15:48

diana_coman: On the first couple rounds I'm after new sites. On the last round I'm after blogposts specifically. The thing I'm chewing on now is cutting the uninteresting stuff out of the file full of urls to images and everything else without stripping it down to the bare domains.

15:48

diana_coman: As this works now, it curls one site puts all the urls in a file, the next step produces from that a smaller file of only new site urls, the third step curls the sites creating a large file of all encountered urls, fourth step trims it down to sites...

15:48

diana_coman: So where I want to go is from an "all urls file" to "urls scrubbed of images, .js, .css, etc", from there retrieve urls and screen for comment boxes in the next cut.

15:49

jfw: well, you probably have way more practice figuring things out on your own than through discussion, don't you?

15:49

diana_coman: In between "scrub images etc" and "retrieve urls looking for comment boxes", I'm uncertain if I want to add a "cut the list to 3 or 4" urls per site step.

15:49

diana_coman: yep

15:50

jfw: that's pretty much the underlying cause really - in other words simply lack of practice.

15:51

and it's quite possibly further coming from the fact that yeah, not much to get from asking questions of the clueless and so on, to the full context; but the solution is still...practice.

15:52

makes sense.

15:53

BingoBoingo: it's not about elegant or anything of the sort; but to start with, a program executes a series of steps itself, it doesn't have to be one step one script; the point and my repeated asking for your "steps" is to figure out what are you trying to achieve at one *stage* if you prefer; ie stage 1: discovery of linked domains starting from a given domain; 2. finding all pages with a comment box for a given domain

15:54

BingoBoingo: basically you have a big problem to solve; you'll have to cut this into smaller problems so you can solve them; if needed, you cut and cut again (divide and conquer , pretty much)

15:55

then once you have one small-enough problem, that you *know* how to solve *manually*, you simply take those manual steps and tell the machine to do them.

15:57

http://logs.ossasepia.com/log/ossasepia/2020-03-04#1020036 - heh, now I suspect you've been reading the #e logs of today, lol

15:57

ossabot

Logged on 2020-03-04 17:03:48 jfw: trying to cover too much ground and possibly introducing bad assumptions rather than starting with something simpler

15:58

I haven't actually

15:59

jfw: you know, one of the good things in academia is that you *have to* ask questions; as in, if you listen to a presentation, whatever it might be, on whatever topic and regardless of how well or badly made, at the end you *have to ask* at least x questions; that's practice, pure and simple and it...works.

16:00

looking back at it (as I was initially rubbish at this part), I think initially I simply studied other people's questions to figure out how they managed it, lolz

16:00

http://logs.ossasepia.com/log/ossasepia/2020-03-04#1020054 - then even more well done you!

16:00

ossabot

Logged on 2020-03-04 17:15:06 jfw: I haven't actually

16:01

diana_coman: Thank you. I'll get to breaking these problems up some more.

16:01

(today's #e log is not directly on question asking but it is on exploring what is pretty much a big unknown and it touches at times on what makes for a better initial exploration precisely on the grounds you gave re possibly introducing bad assumptions if not simple enough)

16:02

BingoBoingo: yw; is it clear to you what & how there? because I really don't want that it blocks you even more somehow.

16:03

diana_coman: interesting, I hadn't heard about the mandatory questions. Re #e, perhaps it's that you brought the notion through your feedback, and I attempted to expand.

16:05

might be.

16:05

jfw: since you have presentations at your Junto meetings for that matter, do you have questions at the end?

16:06

heh, sometimes we have to tamp down on questions popping up throughout so as to get to the end

16:07

jfw: ahaha, that's good then; is it *you* asking questions though? :P

16:07

diana_coman: The whats seem clear. The hows less so, but enough to get moving.

16:07

BingoBoingo: alright then.

16:08

diana_coman: sometimes; though hm, possibly less on the more unfamiliar topics.

16:10

mandatory questions afterward sounds like a great addition actually.

16:10

jfw: in principle there's nothing wrong with just agreeing to keep questions for the end (as some of them might be answered at times simply at a later point in the presentation) and otherwise set mandatory questions at the end, yeah

16:22

so I wasn't sure what "high strung" meant, my guess was something in the vein of pretentious or stuffy or bombastic (not that those are all that similar), but I'm reading it's more in the vein of nervous or tense, which certainly seems to fit better here. Is that right diana_coman?

16:24

and that'd be another example of where I coulda figured out by asking earlier!

16:27

*

jfw afk, food

~ 15 minutes ~

16:42

jfw: ah, not at all stuffy/bombastic/pretentious, no; and not nervous either; and note that I use adverbs correctly, it's highly (not "high") strung for a reason! if you think of how you tighten/loosen up strings on a guitar, that's pretty much the analogy there - you kept stretching and tuning and fiddling with it that the result is a highly strung (and generally too tightly but not only that) text/string.

16:43

*

diana_coman will be back tomorrow.

~ 1 hours 27 minutes ~

18:11

whaack

when i run top on one of my vms, I get "Mem: 3922344k total, 1768028k used, 2154316k free, 143744k buffers" for the line that describes memory usage. When I inspect how much memory an individual process is using on the same vm with the command pmap, i get "total 3245868K" for the last line. Why would pmap report more memory being used by one process than top reports for all proccesses?

~ 16 minutes ~

18:27

whaack: do you know how virtual memory works?

18:30

whaack

jfw: No, I do not

~ 55 minutes ~

19:25

whaack: sorry 'bout the delay, got my attentions diverted. It's worth learning about (what, they didn't have any comp arch class at that MIT?!) but the short version is each process has its own address space, portions of which get mapped to different things such as physical RAM, files, hardware registers and such by the OS and CPU (MMU specifically).

19:27

so what you're looking at with pmap is the total mappings, many of which may be shared with other processes, not actually allocated due to overcommit, and so on.

19:28

The RES line in top or ps listings tends to the be closest approximation of actual usage attributable to the process in my understanding.

19:28

(resident set size)

~ 21 minutes ~

19:50

whaack

jfw: no worries, thank you. Yes MIT did, but through my fault the material didn't stick with me. I'll read up on the subj more later, I'm about to head out to the airport.

19:50

ossabot

Logged on 2019-10-13 10:00:22 whaack: yes it did, but i ~failed that course

19:53